image compression using simulated annealing · image compression using simulated annealing aritra...

Image Compression Using Simulated Annealing

Aritra Dutta⇤, Geonwoo Kim†, Meiqin Li‡,Carlos Ortiz Marrero§, Mohit Sinha¶, Cole Stieglerk

August 5–14, 2015

Mathematical Modeling in Industry XIX

Institute of Mathematics and its Applications

Proposed by: 1QB Information TechnologiesMentors: Michael P. Lamoureux1, Pooya Ronagh2

Abstract

Over the course of this two week workshop, we developed an al-gorithm for image compression using an optimized implementation ofsimulated annealing intended to solve Ising spin problems. Our moti-vation is to be able to execute this algorithm using the 1QBit interfacefor the D-Wave quantum computational hardware system. We also ex-plore a combination of simulated annealing and regression techniquesand compared their performance. Finally, we discuss ways to optimizeour algorithm in order to make it feasible for a D-Wave architecture.

⇤University of Central Florida†Pusan National University‡Texas A & M University§University of Houston¶University of Minnesota, Twin CitieskUniversity of Iowa1Pacific Institute for the Mathematical Sciences21QB Information Technology, Vancouver B.C.

1 Introduction

Simulated annealing (SA) is a commonly used algorithm for heuristic op-timization. Inspired by the study of thermal processes, this algorithm hasbeen particularly successful at providing approximate solution to NP-Hardproblems [4]. The algorithm essentially performs a Monte Carlo simulationalong with a transition function. At the beginning of the simulation thealgorithm is encourage to explore the landscape of the objective functionwhile slowly lowering the probability of abrupt transitions. If the processis done slowly enough then after a few repetitions one can hope to find theoptimal value.

Our goal is to apply this algorithm to image compression and imagereconstruction. We use an implementation of SA optimized to solve Isingspin type problems [5]. The algorithm’s original purpose was to provide ahighly optimized implementation of SA and compare its performance againsta D-Wave device. It turns out that finding the lowest energy state in an Isingmodel is equivalent to solving a quadratic unconstrained binary optimizationproblem (QUBO).

This investigation is motivated by the assumption that our algorithmcould be implemented in the processor produced by D-Wave Systems, Inc.An advantage to the D-Wave system is that it can produce a spectrum ofoptimal and suboptimal answers to a QUBO/Ising optimization problem,rather than merely the lowest energy point. This work grew out of thequestion of trying to determine if the sparse recovery problem is a feasibleproblem for the D-Wave architecture.

2 The Ising Model

The Ising spin model is a widely used model that can be used to describeany system that have a set of individual elements interacting via pairwiseinteractions [1].

Definition 2.1 (Ising Problem). Let G = (V,E) be a graph on n vertices

with vertex set V , edge set E, and let si

2 {�1, 1} for i 2 V . For a given

configuration s = (s1, s2, . . . , sn) 2 {�1, 1}n, the energy of this system is

given by,

H(s) =X

k2V

hk

sk

+X

(i,j)2E

Jij

si

sj

= hh, si+ hs, Jsi

where h = (h1, . . . , hn) 2 Rn

and J = (Jij

) 2 Mn

(R).

1

The simulated annealing implementation [5] we will be using is designedto minimize the function H(s).

3 Image Compression/Reconstruction Problem

The underlying optimization problem studied here is to find a sparse solutionto the underdetermined linear system Ax = b, where A is an m⇥ n matrixand b is an m-vector, and m n. The problem can be written as

(1)minimizex2{0,1}n

||x||0

subject to ||Ax� b||22 �

Note that for a binary vector x 2 {0, 1}n, the two norms ||x||0 and ||x||1 areidentical.

This problem can be interpreted as an image compression problem whereyour goal is to find a sparse binary vector x such that its image under A isclose to the original image b. Alternatively, you can think of this problem asa image reconstruction problem where your goal is to find a sparse binaryvector x that was corrupted by A, but we only have access to the corruptedimage b. Here we will only discuss the image compression problem.

4 QUBO and the Ising Model

It turns out that problem (1) can be relaxed to the following unconstrainedoptimization problem,

(2) minx2{0,1}n

||x||0 + �||Ax� b||22

for large constraint parameter � > 0. Let e := (1, . . . , 1) 2 {0, 1}n andconsider the following reformulation of (2)

||x||0 + �||Ax� b||22 = he, xi+ �hAx� b, Ax� bi= he, xi+ �hAx,Axi � 2�hAx, bi+ �hb, bi= he, xi+ �hA⇤Ax, xi � 2�hAx, bi+ �hb, bi= he� 2�A⇤b, xi+ h�A⇤Ax, xi+ �hb, bi

Notice that (2) is a QUBO problem.

2

Definition 4.1 (QUBO). Let g(x) = hx,Qxi + hc, xi, where Q is a sym-

metric matrix, c 2 Rn

, and x = (x1, . . . , xn) 2 {0, 1}n. The quadratic

unconstrained binary optimization problem is to minimize the function g(x)over {0, 1}n.

In order to use the solver we need to formulate our problem as an Isingproblem. Consider the following change of variables,

x =1

2(s+ e) =) x 2 {0, 1}n

Now equation (2) becomes,

mins2{�1,1}n

1

2et(s+ e) + �||1

2A(s+ e)� b||22

5 Results

In this section we discuss our implementation of the problem discussed insection 3, illustrating how our algorithm e↵ectively reduces the image sizewhile maintaining a reasonable level of quality. Recall our problem is

minx2{0,1}n

kxk1 + �kAx� bk22

where b is the original image and A is our blurring operator, in the form ofa convolution expressed explicitly as a sparse matrix. In order to define aconvolution, we must choose an appropriate kernel first. We found Gaussianand averaging disk kernels performed significantly better than other typesof kernels, so we limit our discussion to these kernels.

We use the normalized mean square error (NMSE) as our performancemeasure,

NMSE =kAx� bk2

kbk2where here b is an array of grayscale values representing a target image andAx is the blurring convolution applied to the binary compression x.

The following computations were performed using a Macbook Pro onOSX 10.10 with a 2.3 GHz Intel Core i7 and 16 GB of memory. For ourimages, we used Matlab’s built-in clown and mandrill images. Each tookapproximately 40 seconds to perform the entire computation. Most of thecomputation time was spent in the C solver written by Troyer et al. [5]. Sincethe solver is technically taking the place of the quantum annealer, the long

3

Figure 1: True image, binary representation, and smoothed reconstruction.

computation time is not concerning as the true quantum hardware wouldexecute this much faster. The original clown image, the binary image, andthe smoothed binary image are shown in Figure 1, with parameter � = 100.

Initially, we started with a rather large penalty to encourage kAx � bkto be as small as possible, but at the expense of a less sparse results. Onaverage, the binary images had 30-35% nonzero entries in the solution beforebeing blurred. With � = 100 we obtained the results shown in Table 1for the di↵erent kernels. Note that smaller NMSE values indicate betterperformance.

In an attempt to further improve our results, we combined the blurredimages Ax in a couple di↵erent ways. First was a simple averaging combina-

4

Table 1: NMSE results for various smoothing kernels. (Smaller is better.)

Kernel Type Size Standard Deviation NMSE

Gaussian 3x3 1 0.1689Gaussian 5x5 1 0.1591Gaussian 7x7 1 0.1576Gaussian 3x3 0.9 0.1720Gaussian 5x5 0.9 0.1573Gaussian 7x7 0.9 0.1557Gaussian 3x3 0.8 0.1837Gaussian 5x5 0.8 0.1673Gaussian 7x7 0.8 0.1586

Disc 3x3 n/a 0.1878Disc 5x5 n/a 0.1542Disc 7x7 n/a 0.1830Disc 9x9 n/a 0.2015

tion, where we took the mean of {A1x1, A2x2 . . . } and found the di↵erencebetween that mean and the original image. The other type of combinationperformed was done by finding the element-wise maximum among the ele-ments of {A1x1, A2x2 . . . } and used that as the final value for that entry ofAx. The averaging combination proved to be more e↵ective with a largerpenalty when the results were already quite good, but the combination madelittle di↵erence. The maximum combination was more e↵ective with sparserimages corresponding to lower penalty, e↵ecting a significant improvementin error in those cases. We believe that in the sparse case, this can bethought of as kernel selection, where the best shaped kernel for a given pixelregion is selected in the binary representation and then is blurred out by theappropriate operation.

Combining size 3x3, 5x5, 7x7 Gaussians with standard deviation .9 aswell as size 5x5, 7x7 discs, max-combination error was 0.1736, average-combination error was 0.1454. Combining size 3x3, 5x5, and 7x7 Gaussianswith standard deviation .8 as well as size 3x3, 9x9 discs, max-combinationerror was 0.1958, average-combination error was 0.1444. In general, witha large penalty, the average-combination error was smaller than the max-combination error. The best result above was from the average-combination

5

with error 0.1444, which is visually rather close to the original image and isshown in Figure 2.

Figure 2: Average-combination of kernels MSE = 0.1444, � = 100

We also used a small penalty to encourage sparseness, making the kxk1term more influential in the minimization problem. With penalty = 0.7,the results had an average of 10% nonzero entries in the binary solutions.Increasing sparseness is possible with lower penalty, but at the expense of alarge increase in error.

Combining size 3x3, 5x5, 7x7 Gaussians with standard deviation .9 aswell as size 5x5, 7x7 discs, max-combination error was 0.5131, average-combination error was 0.5945. Combining size 3x3, 5x5, 7x7 Gaussianswith standard deviation .8 as well as size 3x3, 9x9 discs, max-combinationerror was 0.5298, average-combination error was 0.6024.

In general, with a large penalty, the average-combination error was largerthan the max-combination error. The best result above was the max-combination error of 0.5131 and is shown in Figure 3. However, the largesparseness results in poor image reconstruction.

Overall, this approach is valid as a image compression and reconstruc-tion technique, at least with the larger penalty, since the method convertsa grayscale (8bit) image into a binary (1bit) image, which is a reduction in

6

Figure 3: Max-combination of kernels MSE = 0.5131, � = 0.7.

memory storage of 87.5%. The remaining memory required to reconstructthe image to the 86% accuracy found above is at most two or three parame-ters: the type, size, and possibly standard deviation of the kernel. Thus wehave accomplished a large reduction in memory required with only a smallloss in image accuracy.

Having examined exact NMSE results for two specific penalty values,we will now investigate the relationship between the penalty � of the leastsquares term in the objective function, sparsity of these binary images, andNMSE of the blurred images over a range of penalty values. Sparsity asdepicted below is the number of nonzero elements (ones) in the binary im-age divided by the total number of pixels (64000). As the actual error andsparsity varied little between di↵erent types of Gaussian kernels, a 3x3 Gaus-sian kernel with standard deviation 0.9 was used to produce these results.First we look at the relationship between penalty and sparsity, as shown inFigure 4.

One can see our sparsity starts quite low and rapidly levels out with anincrease in the penalty. This is unsurprising, as with very low penalty ourminimization problem is simply minimizing the ones in the image with noconstraint, while with higher penalty the least squares term dominates and

7

Penalty0 10 20 30 40 50 60 70

Spars

ity

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Relationship between penalty and sparsity

Figure 4: Sparsity v.s. penalty.

we e↵ectively ignore the sparsity component of the objective function. Next,we compare the penalty and the NMSE of the objective function, shown inFigure 5.

Again, the results are not surprising. Since the Frobenius norm of ma-trices is e↵ectively the least squares term in our objective function, as weincrease the penalty to weight the objective function more heavily towardsthe least squares term we will have a corresponding decrease in the Frobe-nius norm of the NMSE. For both of the above plots, applying a log functionto the penalty term reveals they are roughly inverse relationships, but notexactly, as shown in Figure 6.

Finally, we look at the relationship between sparsity and error, shownin Figure 7. Unsurprisingly, as sparsity decreases (the percent sparsity goesup) we see a decrease in NMSE:

Ultimately, the increase in sparsity (lower value of sparsity) does notjustify the corresponding increase in error. While 5% sparsity would take asixth of the memory of 30% sparsity, the tradeo↵ between about 20% errorfor the less sparse image and about 90% error for the sparser image is notnearly worth it. By virtue of compressing our image to a binary image,we have already accomplished significant memory reduction with minimalincrease in error, thus a further - smaller - reduction in memory usage doesnot justify a massive decrease in quality of the image.

8

Penalty0 10 20 30 40 50 60 70

Err

or

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Relationship between penalty and NMSE of blurred image

Figure 5: NMSE v.s. penalty.

log(Penalty)-2 -1 0 1 2 3 4 5

Err

or

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Relationship between log(penalty) and error

Figure 6: NMSE v.s. log penalty.

9

Sparsity0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Err

or

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Relationship between sparsity and NMSE of blurred image

Figure 7: NMSE v.s. log sparsity.

6 Regression techniques

In this section, we discuss our explorations in other image reconstructionmethods that we used to compare and contrast with the SA implementationsthat were developed. In particular, we considered both regular least squaresand ridge regression (Tikhonov regularization) to reconstruct the image.

6.1 Least squares and ridge regression

Let A 2 Rm⇥n be the measurement matrix, such that m > n and rank(A) =n. Unless the measurements are perfect, the image vector b is outside thatcolumn space of A. Therefore it is hard to find an element x 2 Rn whichgives an exact solution to the overdetermined system

Ax = b,(3)

even when the target b is in the range of A.One can still obtain an approximate solution to (3) by solving the fol-

lowing minimization problem:

xLS

= arg minx2Rn

kAx� bk22.(4)

10

Figure 8: Projection p = Ax is closest to b; so x minimizes E = kb�Axk22.

This least squares solution is given by xLS

:= (ATA)�1AT b. Note thatfinding x

LS

involves inversion of the matrix ATA. If m ⇡ n, and matrixA is ill-conditioned, then ATA is singular or “nearly singular”. Moreover,if x

LS

has all m components non-zero, then it is not suitable as a sparsevector for explaining the data. To give preference to a particular solutionwith desirable properties one can solve the regularized problem

minx2Rn

kAx� bk22 + ⌧kxk22,(5)

where ⌧ > 0 is a fixed balancing parameter. In figure (9), the solid bluearea represents the constraint region of kxk22, while the red ellipses are thecontours of the least square error function. The ridge regression solutionto (5) is given by x

Ridge

= (ATA + ⌧In

)�1(AT b). The minimum eigenvalueof ATA + ⌧I

n

is greater or equal to ⌧, which guarantees the invertibilityof (ATA + ⌧I

n

). If the measurement matrix A is augmented with n addi-tional row

p⌧I

n

, and the vector b with n zeros, then (5) can also be viewed

as a least squares solution to the augmented dataset

✓Ap⌧I

n

◆x =

✓b0n

◆.

Therefore (5) is equivalent to solve

xRidge

= arg minx2Rn

k✓

Ap⌧I

n

◆x�

✓b0n

◆k22.(6)

11

Figure 9: Ridge regression estimate in R2.

6.2 Implementation

Recall that the image obtained from the SA has binary entries. Let x⇤ 2{0, 1}n be the binary image obtained from SA. Denote T := {i : x⇤

i

= 1} asthe support set of x⇤. We form a truncated matrix A

T

from A such thatA

T

= Am⇥|T |

, where |T | is the cardinality of the set T . We use AT

to solvea “truncated least squares” system:

xTLS

= arg minx2R|T |

kAT

x� bk22.(7)

Finally, we replace xT

by xTLS

. Next, we use “truncated ridge regression”to solve the minimization problem:

xTR

= arg minx2R|T |

kAT

x� bk22 + ⌧kxk22.(8)

As before, we replace xT

by xTR

.Figures 10 and 11 show a comparison of the results from our SA imple-

mentation, least squares, and ridge regression. Observe the SA implemen-tation is quite successful in comparison to these other two methods. Thetruncated ridge regression is better than truncated least squares.

12

Original Image

(a)

Blurred SQA Reconstruction

(b)

Figure 10: (a) Original Image, (b) SA reconstruction

(a) (b)

Figure 11: Comparison between (a) truncated least squares and (b) trun-cated ridge regression

13

7 SPGL1: A Solver for Large-scale Sparse Opti-mization

In this section we discuss our use of a standard large scale sparse solverSPGL1 (see reference [3]) and compare our SA results in reconstructing theimage.

7.1 Outline of the method and results

Solving the system Ax = b where A 2 Rm⇥n such that m << n, su↵ersfrom ill-posedness. Classic sparse convex optimization problems which tryto solve the system are

1. minx

kxk1 subject to Ax = b. (BP)

2. minx

kxk1 subject to kAx� bk2 �. (BP�

)

3. minx

kAx� bk2 subject to kxk1 ⌧. (LS⌧

)

Homotopy approaches, Basis Pursuit Denoting (BPDN) as a cone program,BPDN as a linear program, and Projected gradient method are the classicapproaches to solve the above problems. If b 2 R(A) and b 6= 0, denotex⌧

as the optimal solution of (LS⌧

). In SPGL one can consider the single-parameter function

�(⌧) = kr⌧

k2, with r⌧

:= b�Ax⌧

;

which gives an optimal value (LS⌧

) for each ⌧ > 0. So the method lies infinding a root of

�(⌧) = �.

14

In order to derive the dual of (LS⌧

), this method solves an equivalent problem

minr,x

krk2 subject to Ax+ r = b, kxk1 ⌧.

Therefore dual of the above problem is

maxy,�

{minr,x

{krk2 � yT (Ax+ r � b) + �(kxk1 � ⌧)}},

for � > 0. Finally the dual of (LS⌧

) reduces to

maxy,�

bT y � ⌧� subject to kyk2 1, kAT yk1

�.

Theorem: With this setup, the following holds:

1. The function � is convex and non-increasing.

2. For all ⌧ 2 (0, ⌧BP), � is continuously di↵erentiable, �0(⌧) = ��⌧

, andthe optimal dual variable �

⌧

= kAT y⌧

k1

, where y⌧

= r⌧

/kr⌧

k2.

3. For ⌧ 2 [0, ⌧BP], kx⌧k1 = ⌧, and � is strictly decreasing.

The algorithm: Based on Newton iteration find

⌧k+1 = ⌧

k

+ �⌧k

where

�⌧k

=� � �(⌧

k

)

�0(⌧k

).

In Figure 12, we contrast the results from SA minimization and the SPGL1results. It is interesting to note that SPGL1 gives immediately a greyscaleimage, as it optimizes over a continuous range of x-values, while the SA resultshown here only uses binary values. The reconstructed image in Figure 1 isa better indication of the good results obtainable with SA.

8 Other attempts

While working on the project, we had an idea that we might be able to useSA to remove systematic blur from an image, such as the simulated motionblur shown in Figure 13. It was an interesting idea, but it is not clear thatwe obtained useful results. So we simply mention it here as an idea possiblyworth pursuing.

15

SQA Reconstruction : Ridge Regression

(a)

(b)

Figure 12: Comparison between (a) SA and (b) SPGL minimization.

16

Blurred Image

Figure 13: Blurred image

9 Tuning and Optimizing the Algorithm

A significant concern with a real implementation of the quantum optimizeris that the computing hardware has a limited number of nodes to repre-sent data in the Ising model. For instance, current hardware from D-Wavelimits this to about 1000 nodes. The image compression algorithm requireshundreds of thousands of nodes, which is problematic for existing hardware,thus we had to consider methods to break the large problem into smaller,computable problems.

In this section we discuss an e�cient way to tune and optimize thealgorithm. Recall that the original linear system is Ax = b, where A 2Rm⇥n, and m < n. In our examples, m is very large and solving the systemrequires a huge amount of memory or compute nodes. However, things arebetter if one can find a B 2 Rk⇥m, with k < m, such that for a predefined✏ > 0, kx� xk2 < ✏, where Ax = b and A = BA, b = Bb.

9.1 Optimizing the Algorithm: Reducing Rows

Construct a vector b from b such that b = (b(i))m 2 Rm where |b(1)| � |b(2)| �· · · � |b(m)| � 0. For a tolerance 0 < ↵ 1, choose b(1 : k) if

kb(1 : k)k2kbk2

> ↵.(9)

17

Let S = {i : kb(1:k)k2kbk2

> ↵} be the support set and we construct A, such

that A 2 R|S|⇥n. We form B =�ei1ei2 · · · eik

�T

, where eij is a 1⇥m vector

with 1 in the ij

th position and 0 elsewhere. To summarize, B acts as an

indicator matrix which constructs A based on (9). We use SA on Ax = b,to reconstruct the image as shown in Figure 14.

No. of rows: 64000->10843

Figure 14: Reduced row SA reconstruction

Indeed this is a memory e�cient reconstruction. Originally A had 64000rows. Using the indicator matrix B with ↵ = 0.7 reduced the number of rowsin the image to 10843. On the other hand we also sacrificed the quality ofthe reconstructed image. For better reconstruction we target smaller blocksof the image instead of the entire matrix. We divide b in to sub-vectors b

i

using (9) such that 1 i k << m. We solveP

p

j=1Aijxi = bi

. Finallywe construct x

Recovered

= (xj

: 1 j p), where xj

is a solution to thesystem

Pp

j=1Aijxi = bi

for each i. Now we use the idea of row reductionon each block A

ij . For each block, using the previous technique we solvePp

j=1 Aijxi = bi

, whereP

p

j=1 Aij =P

p

j=1Bi

Aij = B

i

Ai

and Bi

bi

= bi

for1 j k and we obtain the recovered image as x

Recovered

= (xj

: 1 j p)

where xj

is a solution to the systemP

p

j=1 Aijxi = bi

for each i. For a

predefined ✏ > 0, we can guaranteeP

k

j=1 kxj � xj

k2 < ✏. The result of rowreduced block reconstruction is shown in Figure 15.

We partitioned the image array into 40 sub matrices, compressed eachblock, and reconstructed. One can notice the partition lines in the SA

18

recovery--by reducing the NO. of rows for each block

Figure 15: Row reduced block SA reconstruction

reconstructed image in Figure 15. To avoid that, we use overlapping blockpartitions of the image and use row reduced SA reconstruction technique.At the end we merged the reconstructed overlapping blocks and obtain amuch better image as shown in Figure 16.

Figure 16: Overlapping blocks row reduced block SA reconstruction

19

10 Conclusion

To summarize, our compression algorithm managed to accomplish a largereduction in memory while maintaining a minimal loss in image accuracywhen our penalty was large enough. We managed to get the best reconstruc-tion by using an average of kernels. We found that by encouraging sparsity(decreasing the penalty) we lose accuracy. After trying to reconstruct theimage with one of the SPGL1 regression technique we found no improvementover a reconstruction using a kernel.

Now that we have develop a working algorithm for image compression,a natural next step is to attempt to test this algorithm using a quantumannealer. Here is where our optimization methods might come in handywhen trying to implement this in a D-Wave system.

References

[1] Zhengbing Bain, Fabian Chudak, William G. Macready, and GeordieRose, “The Ising model: teaching an old problem new tricks,”D-Wave Systems Technical Report Aug. 30, 2010. Available from/www.dwavesys.com/resources/publications.

[2] E. van den Berg and M. P. Friedlander, “Probing the Pareto frontierfor basis pursuit solutions”, SIAM J. on Scientific Computing, vol. 31,no. 2, pp. 890–912, Nov. 2008.

[3] E. van den Berg and M. P. Friedlander, “A solver for large-scale sparsereconstruction,” http://www.cs.ubc.ca/labs/scl/spgl1, June 2007.

[4] V. Cerny, “Thermodynamical approach to the traveling salesman prob-lem: An e�cient simulation algorithm,” J. Optimization Theory andApplications, vol. 45, no. 1, pp. 41–51, Jan. 1985.

[5] S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, and M. Troyer, “Opti-mized simulated annealing for Ising spin glasses,” Computer PhysicsCommunications, vol. 192, pp. 265–271, Jul. 2015.

20

image compression using simulated annealing · image compression using simulated annealing aritra...

Documents