stefan heule august 22, 2011summaries.stefanheule.com/download/cil.pdffigure 3: left and right...

Computational Intelligence Lab FS2011 1

Stefan Heule

August 22, 2011

1License: Creative Commons Attribution-Share Alike 3.0 Unported (http://creativecommons.org/licenses/by-sa/3.0/)

http://creativecommons.org/licenses/by-sa/3.0/

http://creativecommons.org/licenses/by-sa/3.0/

Contents

1 Introduction 1

2 Dimension Reduction 2

2.1 Principle Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Computing PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.2 Successive Variance Maximization . . . . . . . . . . . . . . . . . . . . . . 3

2.1.3 Minimum Error Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.4 Matrix Viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 SVD Component Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.3 SVD Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.4 Range, Nullspace, Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.5 SVD to Reveal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.6 Closest Rank-k Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.7 Computing the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.8 SVD of Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.9 Relationship to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Clustering 8

3.1 The K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Model Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Clustering Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Soft Clustering 11

4.1 Probability Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.1 Categorical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.3 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 13

4.1.4 Multidimensional Moment Statistics . . . . . . . . . . . . . . . . . . . . . 13

4.1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Introduction to Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Expectation Maximization for GMM . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Sampling 16

5.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1.1 Expectation from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1

5.3 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Role Mining and Multi-Assignment Clustering 19

6.1 Role-Based Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3 Evaluation Criteria for Boolean Matrix Decomposition . . . . . . . . . . . . . . . 20

6.3.1 Evaluating a Matrix Reconstruction . . . . . . . . . . . . . . . . . . . . . 21

6.3.2 Evaluation Inference Quality . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.4 SVD for Role Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.5 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.6 Simple Approach Using a Variant of K-means . . . . . . . . . . . . . . . . . . . . 23

6.7 The RoleMiner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.8 DBPsolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Sparse Coding 26

7.1 Comparison to Principle Component Analysis . . . . . . . . . . . . . . . . . . . . 26

7.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.3 Summary of Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.4 Overcomplete Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.4.1 Signal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.4.2 Noisy Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.4.3 Matching Pursuit (MP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.4.4 Support Recovery of MP . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.4.5 Sparse Coding for Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.5 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.5.1 Coding Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.5.2 Dictionary Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.5.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1 Introduction

The core problem we consider is matrix factorization (MF) of a given data matrix into twoor three unknown matrices. Many important data analysis techniques can be written as MFproblems by specifying:

• Type of data: e.g. boolean (xd,n ∈ 0, 1) or non-negative real.

• Approximation: e.g. exactX = A·B or minimal Frobenius norm minA,B‖X −A ·B‖2F .

• Constraints on unknown matrices: e.g. A has to be a basis or B must be sparse.

Often, MF is computationally NP-hard, therefore we need ecient and low-error approximationalgorithms. We study the following techniques:

1. Dimension reduction

2. Single-assignment clustering

3. Multi-assignment clustering

4. Sparse coding

Denition 1.1 (Orthogonality). Two vectors in an inner product space V are orthogonal iftheir inner product is zero.

An orthogonal matrix is a square matrix with real entries whose column (or rows) are orthogonalunit vectors.

Theorem 1.1. The inverse of an orthogonal matrix is its transpose.

1

Figure 1: Projection of data points.

2 Dimension Reduction

Given N vectors in D dimensions, we want to nd the K most important axes and project them.We reduce the dimensions because:

• Some features may be irrelevant

• We want to visualize high dimensional data, e.g., in 2 or 3 dimensions.

• The intrinsic dimensionality may be smaller than the number of features recorded in thedata.

We want the most interesting dimensions:

• data compression

• feature selection

• complexity reduction

2.1 Principle Component Analysis (PCA)

Orthogonal linear projection of high dimensional data onto low dimensional subspace, as shownin Figure 1.

We have the following objectives when performing a PCA:

1. Minimize error ‖xn − xn‖2 of point xn and its approximation xn.

2. Reveal interesting information: maximize variance.

It is possible to show that both objectives are formally equivalent.

2.1.1 Computing PCA

To perform PCA on a data matrix X of size D ×N :

1. Compute the zero-centered data X by subtracting the mean of the sample, X = X −M .

2. Calculate the covariance matrix Σ = 1NXX

T.

2

3. Compute the eigenvectors and eigenvalues of the covariance matrix.

4. Compute the projection of X on the K largest principal components UK = [u1, . . . ,uK ].By ZK = UT

KX.

2.1.2 Successive Variance Maximization

Consider a set of observations xn, n = 1, . . . , N and xn ∈ RD. We want to project data ontoa K < D dimensional space while maximizing the variance of the projected data.

For K = 1 we dene the direction of the projection as u1 and set ‖u1‖2 = 1; only the directionof the projection is important.

For the original data with sample mean x, we have the following covariance:

Σ =1

N

N∑n=1

(xn − x)(xn − x)T

For the projected data with new mean uT1 x, the variance is given by

1

N

N∑n=1

uT1 xn − uT1 x

2=

1

N

N∑n=1

uT1 (xn − x)

2

=1

N

N∑n=1

uT1 (xn − x)(xn − x)Tu1

= uT1 Σu1

We can formulate this as a maximization problem, where we maximize the variance of theprojected data. We seek

maxu1

uT1 Σu1

such that ‖u1‖2 = 1. We use the Lagrangian of this problem

L := uT1 Σu1 + λ1(1− uT1 u1)

and set its partial derivative to zero:

∂

∂u1L !

= 0 =⇒ Σu1 = λ1u1

Therefore, u1 is an eigenvector of Σ and λ1 is its associated eigenvalue. Furthermore, λ1 is alsothe variance of the projected data, since λ1 = uT1 Σu1. This is called the principle direction.

With a similar approach we will nd that the second principle direction is the second largesteigenvector.

The eigenvalue decomposition of the covariance matrix

Σ = UΛUT

contains all relevant information.

For a K ≤ D dimensional projection space we can choose the K eigenvectors with the largestassociated eigenvalues.

3

Figure 2: Singular Value Decomposition

2.1.3 Minimum Error Formulation

We dene an orthonormal basis ud, d = 1, . . . , D of RD. The scalar projection of xn onto ud(magnitude) is given by

zn,d = xTnud

The associated projection onto ud amounts to zn,dud, and thus we can represent every datapoint by

xn =

D∑d=1

zn,dud =

D∑d=1

(xTnud)ud

2.1.4 Matrix Viewpoint

We can also consider PCA from a matrix factorization viewpoint by representing the data as amatrix

X = [x1, . . . ,xN ]

The corresponding zero-centered data is

X = X −M where M = [x, . . . ,x]

We factor X by computing the projection of X on UK = [u1, . . . ,uK ] (eigenvectors with Khighest eigenvalues).

ZK = UTKX

To approximate X we return to the original basis:

X = UK ·ZK

Of course, if we use K = D, we obtain a perfect reconstruction.

2.2 Singular Value Decomposition

The singular value decomposition (SVD) is a widely used technique to decompose a matrix intoseveral component matrices, exposing many of the useful and interesting properties of the originalmatrix like rank, nullspace, orthogonal basis of column and row space, etc. Applications whichdeploy SVD include computing pseudoinverses, least square tting of data, matrix approximation,and determining the rank, range and null space of a matrix.

Theorem 2.1 (SVD Theorem). Let A be any real M by N matrix, A ∈ RM×N , then A can bedecomposed as A = UDV T .

4

Figure 3: Left and right singular vectors

• U is an M ×M orthogonal matrix such that UTU = I(M).

• D is an M ×N diagonal matrix.

• V T is an N ×N orthogonal matrix such that V TV = I(N).

2.2.1 SVD Component Matrices

The elements of D are only non-zero on the diagonal, and are called the singular values. Animportant relation to the rank r of a matrix is

di,i > 0 1 ≤ i ≤ rdi,i = 0 r + 1 ≤ i ≤ ndi,j = 0 i 6= j

By convention, the ordering of the singular vectors is determined by high-to-low sorting of thesingular values, with the highest singular value in the upper left corner of D.

d1 ≥ d2 ≥ · · · ≥ dr > dr+1 = · · · = dn = 0

The rst r columns of U are called the left singular vectors, they form an orthogonal basis forthe space spanned by the columns of the original matrix A. Similarly the rst r rows of V T

are the right singular vectors, they form an orthonormal basis for the row space of A. This isillustrated in Figure 3.

We can also interpret the columns of U as the eigenvectors of AAT . This can be veried by theSVD:

AAT = UDV TV DUT = UD2UT

Similarly for the rows of V T (or the columns of V ), which are the eigenvectors of ATA, since

ATA = V DUTUDV T = V D2V T

2.2.2 Singular Values

The singular values of a matrix A represent the lengths of the semi-axes of the hyper-ellipsoidE dened as E = Ax | ‖x‖2 = 1. Hence, they are non-negative.

2.2.3 SVD Interpretation

Suppose we are given a matrix A that records the ratings of movies by users (rows correspondto users, columns to movies). Then,

5

• U : user-to-concepts anity matrix.

• V : movie-to-concepts anity matrix.

• D: the diagonal elements of D represent the expressiveness of each concept in the data.

2.2.4 Range, Nullspace, Rank

SVD provides an explicit representation of the range and nullspace of a matrix.

• The right singular vectors corresponding to vanishing singular values ofA span the nullspaceof A.

• The left singular vectors correspond to the non-zero singular values of A and span therange of A.

2.2.5 SVD to Reveal Structure

SVD can be viewed from three mutually compatible points of view:

1. SVD is a method for transforming correlated variables into a set of uncorrelated ones thatbetter expose the various relationships among the original data items.

2. SVD is a method for identifying and ordering the dimensions along which data pointsexhibit the most variation.

3. SVD as data reduction method, once we have identied the directions of most variation wecan obtain the best approximation of the original data points using fewer dimensions.

2.2.6 Closest Rank-k Matrix

We can use this decomposition to nd the k-closest approximation. Let A be a matrix of rankR. If we wish to approximate A using a matrix of lower rank K, then

Ak =

K∑k=1

dkukvTk

is the closest matrix in the L2 norm. More formally, we have

minB,rank(B)=k

‖A−B‖2 = ‖A−Ak‖2

Moreover, the magnitudes of the nonzero singular values provide a measure of how close A is toa matrix of lower rank in the Euclidean norm:

‖A−Ak‖2 = dk+1

The spectral norm of a matrix A is the largest singular value of A:

‖A‖2 = σmax(A)

6

2.2.7 Computing the SVD

Given a matrix A

1. Compute its transpose AT and then ATA.

2. Determine the eigenvalues of ATA and sort them in descending order. Square the roots toobtain the singular values of A.

3. Construct the diagonal matrix D by placing the singular values in descending order alongits diagonal.

4. Use the eigenvalues from the second step to compute the eigenvectors of ATA. Place them(in the right order) along the columns of V and compute V T .

5. Compute the inverse of the diagonal matrix, D−1, using (D−1)ii = 1/D−1ii . Compute the

matrix U = AVD−1.

2.2.8 SVD of Symmetric Matrices

Theorem 2.2. If S is a real and symmetric matrix (i.e., S = ST ), then S = UDUT where thecolumns of U are the eigenvectors of S and D is a diagonal matrix with the eigenvalues of S.

2.2.9 Relationship to PCA

Given any m× n matrix A,

• The matrix AAT is a symmetric matrix describing the similarity between the rows of thematrix A. The eigenvectors [u1, . . . ,un] of AAT are the principle components in a PCAof A (equal up to a constant, assuming A is centered) and are the left singular vectors inthe SVD decomposition of A.

• The squared singular values of A are the eigenvalues of AAT and ATA.

7

Figure 4: Quantization of the space RD

3 Clustering

Given a set of data points in a d-dimensional Euclidean space, we want to nd a meaningfulpartition of the data. The partition should group together similar data points (put them in onegroup/cluster), while the dierent clusters should be as dissimilar as possible. Partitioning thedata in this way can uncover similarities between data points and give rise to data compressionschemes.

We can quantize the space RD (as shown in Figure 4). A data point x ∈ RD is represented byone of the K centroids uk ∈ RD for k = 1, . . . ,K. We assign x to the centroid uk∗ that hasminimal distance.

Denition 3.1 (The Clustering Problem). Consider N data points in a D-dimensional space.Each data vector is denoted by xn, n = 1, . . . , N . The goal is to partition the data into Kclusters. More formally, we seek vectors u1, . . . ,uK that represent the centroid of each cluster.A data point xn belongs to the cluster k if the Euclidean distance between xn and uk is smallerthan the distance to any other centroid.

We want to minimize the following cost function

J(U ,Z) = ‖X −UZ‖22 =

N∑n=1

K∑k=1

zk,n‖xn − uk‖22

The vectors zn are the assignments of data points to clusters, and we can have various constraintson Z. For now, we consider hard assignments, i.e., Z ∈ 0, 1K×N with

∑k zk,n = 1 for all n.

That is, one entry per column is set to 1.

We have two problems:

• Determine the set of cluster centroids uk.

• Compute the assignment matrix Z.

Finding a solution to one of them (given the other) is analytically solvable.

We can write assignments (for hard assignments) down in two alternative notations. Either, weuse a vector z ∈ 1, . . . ,KN where every entry indicates (for each data point) to which cluster

8

index it is assigned to, e.g.,

z =

234221

Alternatively, we can use a matrix Z ∈ 0, 1K×N , e.g.,

Z =

0 0 0 0 0 11 0 0 1 1 00 1 0 0 0 00 0 1 0 0 0

3.1 The K-means Algorithm

We alternate between two steps:

• Assigning data points to clusters. Let k∗(xn) denote the cluster index with the minimaldistance between a cluster centroid and the data point:

k∗(xn) = arg mink

‖xn − u1‖22, . . . , ‖xn − uK‖

22

• updating the cluster centroids based on the data points assigned to it. We compute themean/centroid of a cluster as

uk =

∑Nn=1 zk,nxn∑Nn=1 zk,n

for all k

Then, the K-means algorithm can be described as

1. Initiate with random choice for u(0)1 , . . . ,u

(0)K (or use data points, or some other initializa-

tion) and set t = 1.

2. Cluster assignment. Solve for all n

k∗(xn) = arg mink

∥∥∥xn − u(t−1)1

∥∥∥2

2, . . . ,

∥∥∥xn − u(t−1)K

∥∥∥2

2

3. Centroid update. The centroids are given by

u(t)k =

∑Nn=1 z

(t)k,nxn∑N

n=1 z(t)k,n

for all k

4. Increment t and repeat from step 2 until∥∥∥u(t)

k − u(t−1)k

∥∥∥2

2< ε for all k or t = tnish.

9

The computational cost of each iteration is O (KN) and convergence is guaranteed. K-meansoptimizes a non-convex objective and hence we can only guarantee to nd a local minimum. Thealgorithms can be used to compress data (lossy, if K < N by only storing the centroids and theassignments of data points to clusters.

Problems of the approach include the following

• The non-convex objective guarantees only to nd local minima, and the algorithm is sen-sitive to initializations. To some extent, this can be overcome by restarts.

• The optimal number of clusters K is unknown. This problem is referred to as model orderselection or model validation.

3.2 Model Order Selection

We have a trade-o between two conicting goals:

• Data t. We want to predict the data well, e.g., for clustering

J(U ,Z) = ‖X −UZ‖22 =

N∑n=1

K∑k=1


should ideally be zero.

• Complexity. We want a model that is not very complex, which is often measured by thenumber of free parameters.

We can use the negative log-likelihood of data for K-means:

−`(U ,Z |X) = J(U ,Z) = ‖X −UZ‖22 =

N∑n=1

K∑k=1


We can use (at least) two heuristics for choosing the parameter K. We achieve a balance betweendata t (measured by the cost function −`(U ,Z | X)) and complexity, which we measure bythe number of free parameters κ(U ,Z) = K ·D.

• Akaike Information Criterion (AIC)

AIC(U ,Z |X) = −`(U ,Z |X) + κ(U ,X)

• Bayesian Information Criterion (BIC)

BIC(U ,Z |X) = −`(U ,Z |X) + 12κ(U ,X) · lnN

We can then repeat the analysis for dierent values of K and choose the smallest AIC/BIC value.

3.3 Clustering Stability

The idea behind the notion of stability is the following concept. If we repeatedly sample datapoints from the same underlying model or of the same data generating process and apply a

10

clustering algorithm then a good algorithm should return clusterings that do not vary muchfrom one sample to another. Using the K-means algorithm, this selection strategy reduces tochoosing the optimal number of clusters for which stability is given.

We can use the following high-level stability test for a given set of data points and a given numberof clusters:

1. Generate perturbed versions of the set (e.g., by adding noise or drawing sub-samples).

2. Apply the clustering algorithm on all versions.

3. Compute the pair-wise distances between all clusterings.

4. Compute the instability as the means distance between all clusterings.

This procedure can be repeated for dierent numbers of clusters and then one chooses the numberthat minimizes the instability.

To measure the distance between to clusterings that are dened on the same data points, wecompute the distance between clusterings as follows:

1. Compute the distance between two by counting points on which the two clusterings dis-agree.

2. Repeat over all permutations of the cluster labels (as the same cluster might have dierentlabels)

3. Choose the permutation with minimal distance.

In other wordsd = min

π

∥∥Z − π(Z ′)∥∥

0

4 Soft Clustering

The term soft clustering is ambiguous, as it can refer to the algorithm or the model:

1. Algorithmic: soft K-means. In the K-means algorithms, a data point is assigned to oneand only one cluster. A better estimation of the means would be to consider assigning toeach cluster a probability that the data point belongs to it. This is known as the Gaussianmixture model (GMM). Note that the model still assumes that a data point belongs toonly one cluster.

2. Model relaxation: One can also relax the hard constraint given by

zk,n ∈ 0, 1 and

K∑k=1

zk,n = 1

and replace it with the following soft constraint:

zk,n ∈ [0, 1] and

K∑k=1

zk,n = 1

This is known as semi-NMF.

11

Two modications of our objective are possible:

1. Gaussian mixture models. Dierent clustering objective which arises from a generativeprobabilistic model. GMMs can not be written in matrix factorization form. The softnessrefers to the probability of hypotheses and, therefore, its nature is algorithmic.

2. Semi non-negative matrix factorization. Relax the constraints of the K-means ob-jective to zk,n ∈ [0,∞) and drop the constraint on the sum

∑Kk=1 zk,n.

The motivation is to have probabilistic clustering model. We have a generative model:

• Goal: Explain the observed data xnNn=1 by a probabilistic model p(x).

• We assume the parametric form of the model to be chosen a-priori and look at the GMM.

• The model has parameters that need to be learned in order to explain the observed datawell. For instance, the K-means centroids uk are an example of such parameters.

4.1 Probability Basics

4.1.1 Categorical Distribution

A categorical distribution is a discrete distribution over K events that assigns the probability πkto the kth event. There are two codings for the random variable; the vector or index.

• Index coding. The random variable takes the index of the event that occurred.

• Vector coding. The random variable is a vector of size K and if event k occurs then thekth element is set to one, all other to zero.

We will use the vector coding, where the sample space is the set of 1-of-K encoded randomvectors z of dimension K with

K∑k=1

zk = 1 and zk ∈ 0, 1

The probability mass function is dened as

p(z | π) =

K∏k=1

πzkk

where πk represents the probability of seeing element k. We have

πk ≥ 0 and

K∑k=1

πk = 1

4.1.2 Gaussian Distribution

The probability density function of the Gaussian distribution is given by

p(x | µ, σ) = N (x | µ, σ) =1

σ√

2πexp

(− (x− µ)2

2σ2

)

12

The variable µ stands for the mean of the distribution and is at the same place as the mode andthe median. σ2 is the variance. The probability for X ∈ [a, b] is given as

P (a < X < b) =

n∫a

p(x) dx

4.1.3 Multivariate Gaussian Distribution

The multivariate Gaussian distribution is a generalization to higher dimension with sample spaceX ⊆ RD. A random variableX = (X1, . . . , XD)T has a multivariate normal distribution if everylinear combination of its components (i.e., Y = a1X1 + . . . + aDXD has a univariate normaldistribution. Its denition is

p(x | µ,Σ) = N (x | µ,Σ) =1

(√

2π)D|Σ|1/2exp

(− 1

2 (x− µ)TΣ−1(x− µ))

4.1.4 Multidimensional Moment Statistics

The expectation is the vector of component expectations

E [X] = (E [X1] , . . . ,E [XD])T

The variance can then be generalized to the covariance:

Cov(X,Y ) =

∫X

∫Yp(x, y)(x− µX)(y − µY ) dxdy = EX,Y [(x− µX)(y − µY )]

For random variables X1, . . . , XD we record the covariances in the covariance matrix Σ as Σij =Cov(Xi, Xj). This generalizes the notation of variance to sets of random variables for multipledimensions.

4.1.5 Conditional Probability

Bayes' rule. The conditional probability of A given B (posterior) is given by

p(A | B) =p(B | A)p(A)

p(B)

We call p(A) prior, p(B | A) likelihood and p(B) evidence.

4.2 Introduction to Mixture Models

Mixture of K probability densities is dened as

p(x) =

K∑k=1

πkp(x | θk)

Each probability distribution p(x | θk) is a component of the mixture and has its own parameterθk. Almost any continuous density can be approximated using a sucient number of component

13

distributions. For a Gaussian component distribution the parameters θk are given by the meanµk and the covariance Σk.

Mixture models are constructed from (1) the component distributions of the form p(x | θk) and(2) the mixing coecient πk that give the probability of each component. In order for p(x) tobe a proper distribution we have to ensure that

πk ≥ 0 and

K∑k=1

πk = 1

Thus, the parameters πk dene a categorical distribution which represents the probability of eachcomponent distribution.

4.3 Gaussian Mixture Models

The Gaussian mixture model (GMM) uses Gaussians as the component distributions; thus

p(x) =

K∑k=1

πkN (x | µk,Σk)

Given data points x1, . . . ,xN we then aim at learning the parameters µk and Σk, such thatwe approximate the data as closely as possible. This is equivalent to nding the parameters thatmaximize the likelihood of the data.

Generative viewpoint. Given the model parameters Σ, µ, and π, sample the data xn asfollows:

1. Sample a cluster index k according to the probabilities πk.

2. Sample a data point xn from the distribution p(xn | µk,Σk).

Parameter estimation is then to revert this process and go from the data to the parameters.

To estimate the parameters, we assume that the data points x are independent and identicallydistributed (i.i.d.). The probability or likelihood of the observed data X, given the parameterscan thus be written as

p(X | π,µ,Σ) =N∏n=1

p(xn) =N∏n=1

N∑k=1

πkN (x | µk,Σk)

We aim at nding at nding the parameters that maximize the likelihood of the data:

(π, µ, Σ) ∈ arg maxπ,µ,Σ

p(X | π,µ,Σ)

To simplify the expression, we take the logarithm, such that the product becomes a sum:

(π, µ, Σ) ∈ arg maxπ,µ,Σ

N∑n=1

ln

[N∑k=1

πkN (x | µk,Σk)

]This equation does not have a closed-form analytic solution, and here we present a solution basedon expectation maximization. The intuition is as follows: if we know to which clusters the datapoints are assigned, then computing the maximum likelihood is easy. First, we need to introducea latent (or hidden) variable for the assignments of data points to clusters:

14

• We dene a K-dimensional binary random variable z having a 1-of-K representation with

K∑k=1

zk = 1 and zk ∈ 0, 1

• The marginal distribution over z is specied in terms of the mixing coecients πk, i.e.,

p(zk = 1) = πk

Because z uses a 1-of-K representation, we can write this distribution in the form

p(z) =

K∏k=1

πzkk

Similarly, the conditional distribution of x given a particular value for z is a Gaussian

p(x | zk = 1) = N (x | µk,Σk)zk

The conditional probability can be written as

p(x | z) =

K∏k=1

N (x | µk,Σk)zk

We can use Bayes' rule to get the probability of assigning a data point to a cluster:

γ(zk) := p(zk = 1 | x) =p(zk = 1)p(x | zk = 1)∑Kj=1 p(zj = 1)p(x | zj = 1)

=πkN (x | µk,Σk)∑Kj=1 πjN (x | µj ,Σj)

4.4 Expectation Maximization for GMM

Expectation maximization is a powerful method for nding maximum likelihood solutions formodels with latent variables. Setting the derivatives of ln p(X | π,µ,Σ) with respect to themeans µk to zero and multiplying with Σk we obtain:

µk = 1Nk

N∑n=1

γ(zk,n)xn Nk =

N∑n=1

γ(zk,n)

The mean µk is obtained by taking a weighted mean of all the points in the data set. And further

πk =NkN

Informally, we rst choose some initial values for the means and mixing coecients. Then, wealternate between two updates called the E and M step:

1. In the expectation step, the current values for the parameters are used to evaluate theposterior probability γ(zk,n).

15

2. In the maximization step, these probabilities are used to re-estimate the means and mixingcoecients.

More formally, given a GMM, the goal is to maximize the likelihood function with respect to theparameters.

1. Initialize the means µk, and the mixing coecients πk. Set the Σk to the given covariances.

2. E-step. Evaluate the responsibilities using the current parameter values:

γ(zk,n) =πkN (xn | µk,Σk)∑Kj=1 πjN (xn | µj ,Σj)

3. M-step. Re-estimate the parameters using the current responsibilities.

µnewk =1

Nk

N∑n=1

γ(zk,n)xn

πnewk =NkN

Nk =

N∑n=1

γ(zk,n)

4. Evaluate the log likelihood and check for convergence of either the parameters or the loglikelihood.

5 Sampling

The basic idea is to randomly draw congurations from the probability distribution. Probablecongurations appear frequently. This method is convenient, as

• solutions are in the hypothesis space,

• solutions obey to the probability distribution they are sampled from, and

• it works with dicult distributions (e.g., when normalization is infeasible).

5.1 Monte Carlo Methods

The rst problem is to generate samples. That is, for a given probability distribution p(X),generate samples x(r) | 1 ≤ r ≤ R. Instead of performing an integration we can evaluate asum:

1

R

R∑r=1

f(x(r))R→∞−→

∫Ω

f(x) dDx

The second problem is expectation. Estimate the expectation of a given function f(x) underthis distribution:

E [f(x)] =

∫p(x)f(x) dDx

16

Figure 5: Importance Sampling.

5.1.1 Expectation from Samples

The empirical distribution is given as

p(x) =1

R

R∑r=1

δ(x− x(r))

If we can solve the sampling problem, then we can solve the expectation problem by using theempirical mean

E[f(x)] =

∫p(x)f(x) dDx =

1

R

R∑r=1

f(x(r))

If x(r) | 1 ≤ r ≤ R are generated from p(x), then the expectation of E[f(x)] is E [f(x)] by thelaw of large numbers.

5.2 Importance Sampling

Importance sampling is a method for estimating the expectation of the function f(x). Assumewe can evaluate p∗(x) at any given point x:

p(x) =p∗(x)

Zp

where Zp is unknown. Assume that we cannot sample from p(x) but we have a simpler densityq(x) from which we can sample.

q(x) =q∗(x)

Zq

is called the sampler density.

Some samples are under represented, other are over represented, so we correct by weights

wr =p∗(x(r))

q∗(x(r))

which adjust the importance of each point. The expectation is thus

E[f(x)] =

∑r wrf(x(r))∑

r wr

17

Figure 6: Rejection Sampling.

5.3 Rejection Sampling

Rejection sampling allows us to sample from relatively complex distributions. Assume that wecan evaluate p∗(x) at any point x,

p(x) =p∗(x)

Zp

where Zp is unknown. Assume we know a constant c such that

c · q∗(x) > p∗(x) for all x

The algorithm then is as follows:

1. Generate x from the proposal density q(x).

2. Generate a uniformly distributed random variable u from the interval [0, c · q∗(x)].

3. If u > p∗(x) then x is rejected, otherwise it is added to our samples x(r).

Rejection sampling works best if q is a good approximation of p. If q is very dierent from p,then c may be very large (especially in heigh dimensions). Estimating the value of c may behard, since we may not know where the modes of p∗ are, or how high they may be.

18

6 Role Mining and Multi-Assignment Clustering

A simple way to manage access control is called discretionary access control. Assume we haveN users and D permissions, then we use a binary access control matrix X ∈ 0, 1D×N wherexdi indicates whether user i has access to resource d.

This simple idea has the problem of being tedious and expensive to maintain, and irregularities(i.e., potential errors) are very dicult to detect. Such irregularities might be special permissionsthat are given to someone on special-duty (need to be revoked later) or are simply mistakes.

Missing permissions are of course not a problem, but if a user has unnecessary permissions,someone might cause damage accidentally or on purpose. Thus, the principle of least privilegedemands that users have only the privileges needed to full their job. Detecting non-regularpermissions and assessing their risk increases the security on IT systems.

6.1 Role-Based Access Control

Role-based access control (RBAC) simplies access control and is, today, the method of choice.The basic entities of RBAC are:

• USERS, ROLES, OBJ, OPS, the set of users, roles, objects, and operations.

• PRMS ⊆ (obj, op) | obj ∈ OBJ∧op ∈ OPS, the set of permissions.

• UA ⊆ USERS×ROLES, a user-role assignment relation.

• PA ⊆ PRMS×ROLES, a permission-role assignment relation.

• UPA ⊆ USERS×PRMS, a user-permission assignment relation.

• assigned_users(R) = u ∈ USERS | (u,R) ∈ UA

• assigned_permissions(R) = p ∈ PRMS | (p,R) ∈ PA

Each role denes a set of permissions. Users are assigned to a set of roles and get all permissionsof these roles.

We dene the boolean matrix product, denoted by ⊗ as

X = U ⊗Z ⇐⇒ xdi =∨k

(udk ∧ zki)

6.2 Problem Denition

We are interested in nding good roles and role assignment based on a given user-permissionmatrix X. That is, given X, nd U and Z such that

X ≈ U ⊗Z

with X ∈ 0, 1D×N , U ∈ 0, 1D×K , and Z ∈ 0, 1K×N . Note that X is typically noisy, i.e.,it contains permission assignments without an underlying role.

There are three dierent denitions of the role-mining problem:

19

Figure 7: The matrix factorization landscape. Role mining is a multi-assignment clusteringproblem with xdi, udk ∈ 0, 1.

1. Min-noise approximation. Given K, nd matrices U ∈ 0, 1D×K and Z ∈ 0, 1K×Nas

(U , Z) = arg minU ,Z

‖X −U ⊗Z‖1

2. δ-approximation. Given δ ≥ 0, nd the minimal number of roles K and matrices U andZ such that

‖X −U ⊗Z‖1 < δ

3. Inference. Assume that the given matrix X has an underlying structure XS = U ⊗ Zwhich was altered by a noise process XS ∼ P (XS ; ΘS), X ∼ P (X;XS ; ΘN ). Find thestructure encoding in U and Z.

6.3 Evaluation Criteria for Boolean Matrix Decomposition

For each role mining problem denition, there is a set of evaluation criteria:

1. Matrix reconstruction. How accurately does the solution reconstruct the input ma-trix X?

2. Number of sources. How many sources K do you need to obtain a given accuracy?

3. Inference quality. How accurately can you infer the underlying structure?

4. Generalization. How precisely can you roles U reconstruct a new matrix X ′ (with thesame distribution as X)?

5. Stability. How consistent are the solutions of X ′ and X.

20

6.3.1 Evaluating a Matrix Reconstruction

To measure the discrepancy between X and its inferred solution, we have several possibilities:

• Deviation. Compute (used to dene the min-noise approximation):

1

N ·D‖X −U ⊗Z‖1

• Coverage. The ratio of true 1's which are reconstructed as 1's:

Cov :=|(i, j) | xij = xij = 1||(i, j) | xij = 1|

• Deviating ones (zeros). The number of deviating 1's (0's) is the number of matrixelements which are wrongly reconstructed divided by the total number of 1's (0's) in X.

dξ :=|(i, j) | xij = 1− ξ ∧ xij = ξ|

|(i, j) | xij = ξ|ξ ∈ 0, 1

Note that these measures have trivial optimal solutions, and if X is noisy, one might not beinterested in reconstructing every single bit.

6.3.2 Evaluation Inference Quality

Three more adequate (but more involved) measures are:

• Parameter accuracy. If you know the true parameters Θ describing the roles, comparewith your estimators Θ.

Compute the Hamming distance dH between the estimated roles U and the true roles U :

dH(U , U) := minπ∈PK

1

K ·D

K∑k=1

D∑d=1

∣∣ud,k − ud,π(k)

∣∣where PK is the set of all permutations over K elements. Note that the true source mustbe given, which is usually not the case.

• Stability of the clustering solution. Cluster two matrices X,X ′ ∼ P (X; Θ). Do you getsimilar clusterings?

In unsupervised learning, there is no xed labelling of the clusters. Assume we have 2clustering solutions K1 and K2 with clusters numbered 1, . . . ,K and they are dened bythe elements they contain. Also, we know the costs c(k1, k2) for matching cluster k1 fromK1 to k2 from K2. To measure the match between K1 and K2 we need to nd the optimalmapping π∗ between the clusters, or more formally

π∗ = arg minπ

∑k

c(k, π(k))

The permutation π∗ can be eciently found using the Hungarian method.

21

• Generalization ability. How well can you explain new entries with your estimated roles?

To measure how well the inferred set of roles generalizes to a new data matrix X ′, we canselect a random subset D∗ ⊂ 1, . . . , D of the permissions (usually approx. half of thepermissions). For every column i of X ′ (i.e., for every new user)

1. Compute z·,i the optimal role combination of user i:

z·,i := arg minzi

∑d∈D∗

∣∣x′d,i − ud,· ⊗ z·,i∣∣

2. Compute the Hamming distance between the predicted and the actual permissions ofuser i

gi :=1

D − |D∗|∑d/∈D∗

∣∣x′d,i − ud,· ⊗ z·,i∣∣The average over all values is the generalization ability G:

G := 1N

∑i

gi

6.4 SVD for Role Mining

Naively taking the SVD of X and then rounding the left and right singular values to booleanvalues is a bad idea and does not work well. However, it can be used for denoising :

1. Compute the singular value decomposition:

X = U · S · V T

2. Discard columns K + 1, . . . , D of U to get U (K), discard rows and columns K + 1, . . . , Nof S to get S(K) and discard rows K + 1, . . . , N of V to get V (K).

3. Round the reconstructionX(K) := U (K) · S(K) · V T

(K)

to getX(K) = (X(K) > tX)

where tX is a threshold.

For this to work, we actually need three values:

• a threshold tX to be chosen depending on the data

• a hyper-parameter K, the number of dimensions

• an estimate for the reconstruction error

22

6.5 Cross-Validation

For choosing the right parameters, we need three disjoint data sets. If there is only one data setavailable, then one needs to split it into three parts:

• The training data is used to estimate the best value of the parameters θ. For the applicationabove, θ = tX , given a xed K.

• The validation data is used to determine the best value of the hyper-parameters θH (forthe best value of θ). In our setting θH = K.

• The test data is then used to estimate the reconstruction error.

On a high level, the algorithm for choosing a hyper-parameter works as follows:

1. Get three disjoint data (sub)sets.

2. For all possible values of the hyper-parameter θH :

i) Fix θH , and

ii) train you method on the training data to get θ∗, and

iii) validate your method on the validation data.

3. Set

θH∗ =

(Value of θH that maximized the performance on thevalidation data (with respective best value for θ)

)Note that the test data is not used in the complete process.

Once the optimal hyper-parameter θH∗ has been found, the following procedure can be used toevaluate the model:

1. Fix θH∗.

2. Train your method to get the optimal value θ∗ on the training data and the validationdata.

3. Test your method with θH∗ and θ∗ on the test data. This is the nal result of your method.

It is (implicitly) assume that the data in the three parts has the same distribution. The divi-sion into three parts is random and reduces the number of training data. To avoid too muchdependency on this choice, one often uses several splits. This is called cross-validation.

Cross-validation allows one to estimate how well a method will perform on a new, unseen dataset. This procedure avoids overtting, i.e., a too detailed adaptation to the training data whichrenders the method very specic to the training data. Therefore, the test data is only consideredat the very end.

6.6 Simple Approach Using a Variant of K-means

We can use the idea of K-means and adapt the algorithm to boolean data:

23

• Adapt distance to Hamming distance (0-norm):

R(U ,Z) = ‖X −UZ‖0 =

N∑i=1

K∑k=1

zki‖xi − uk‖0

• Restrict centroids uk to boolean values and adapt the centroid update step:

ukd = median(xdn | zkn = 1) for all k and d

However, this approach has several problems:

• The parameter K has to be given (use cross-validation)

• K-means yields disjoint clustering, but in RBAC a user can have more than one role.

6.7 The RoleMiner Algorithm

RoleMiner is one of the rst role mining approaches (heuristic) and comes in two variants;CompleteMiner and FastMiner.

The algorithms uses the observation that users are described by sets of permissions, and thatroles are dened as sets of permissions. The key idea is that a user's permissions are generatedby a combination of roles, and thus all possible intersections of permission sets of all users shouldgive the underlying components (the atomic roles, so to say). For instance, for users (representedas permission sets) 1, 2, 5 and 2, 3, 5 we would have the following candidate roles: 1, 2, 5,2, 3, 5 and 2, 5.

Algorithm 1 The CompleteMiner algorithm

Input: x1: Extract all unique permission sets (columns) x′ from the input.2: Compute all unique set intersections between the members of x′ and add them to the set ofk candidate roles u.

3: for each candidate in u do4: Count how many users have the permission set of the candidate as a subset.5: end for6: return The prioritized set of k roles u, where the counts c = (c1, . . . , ck) give the priority.

A variant of the CompleteMiner is FastMiner, where only the set of pairwise intersectionsis computed. If t is the number of initial roles (i.e., the number of users with unique permissionsets), then CompleteMiner has to compute 2t − t− 1 set intersections, and FastMiner onlyt(t−1)

2 .

Clearly, there are various problems with the RoleMiner algorithm:

• The algorithm is very sensitive to noise:

If only a few individual bits are noisy, the number of candidate roles gets much larger.

Slightly dierent noise leads to a completely dierent solution.

• No assignments of users to roles, only candidate for the columns of U are given (Z isneglected).

24

6.8 DBPsolver

DBPsolver approximately solves the discrete basis problem.

Denition 6.1. For a given boolean matrix X ∈ 0, 1D×N and a number K of basis vectors,nd a boolean matrix U ∈ 0, 1D×K and a boolean matrix Z ∈ 0, 1K×N minimizing

‖X −U ⊗Z‖1

where ‖·‖1 is the L1 norm: ‖M‖1 =∑mi=1

∑nn=1|Mij |

Note that the discrete basis problem and the min-noise role mining problem are identical.

The idea of DBPsolver is the following. To cover most of the 1's inX with as few basis vectorsas possible, candidates should have 1's at dimensions (e.g., permissions) that are frequently 1simultaneously.

DBPsolver computes a D × D association matrix A where add′ is the pairwise associationbetween dimensions (e.g., permissions) d and d′:

add′ := a(d⇒ d′) =〈xd,·,xd′,·〉〈xd,·,xd,·〉

We have add′ = pd′ = 1 | d = 1. In role mining add′ is the empirical probability of havingpermission d′ conditioned on having d.

The algorithm now works greedily as follows:

1. Compute the association matrix A.

2. Compute A′ = (A > τ) by thresholding A with a predened τ . Each column in A′ is acandidate basis vector. In role mining, it represents a candidate role.

3. Initialize U = 0D×K and Z = 0K×N .

4. For k = 1, . . . ,K: substitute column k in U with column t from A′ and modify zk suchthat

R(X,Z,U , w+, w−) =w+ ·∣∣(i, d) | xdi = 1 ∧

(U (t) ⊗Z(t)

)di

= 1∣∣−

w− ·∣∣(i, d) | xdi = 0 ∧

(U (t) ⊗Z(t)

)di

= 1∣∣

is maximized.

w+ and w− are predened parameters that reward and penalize covering 0's and 1's with a 1.

DBPsolver has the following problems:

• The parameters τ , w+, w− and K must be chosen (cross-validation).

• Greedy optimization does not nd the best combination of basis vectors, but only picks asingle best candidate at every step and ignores possible combinations with other vectors.

• Finding the best matrix approximation diers from discovering structure and noise! Thebest t to X might include noise (overtting). This cannot be avoided and is intrinsic tothe problem being solved (i.e., we asked the wrong question).

25

7 Sparse Coding

A signal and its representation are not the same thing, and there is an innite number of possiblerepresentations.

Given a signal f and an orthonormal basis u1, . . . , uL, we can represent the signal by coecients

zl = 〈f, ul〉

The signal f can then be reconstructed as

f =

L∑l=1

zlul =

L∑l=1

〈f, ul〉ul

In sparse coding, we hope that we get only few non-zero coecients. We can compress bydropping some of the coecients and only using certain basis functions for a subset σ:

f =∑k∈σ

zkuk

In this case, the reconstruction error is dened as∥∥∥f − f∥∥∥2

=∑k/∈σ

|〈f, uk〉|2

since 〈uk, uk〉 = 0 for k 6= l.

Of course, the choice of basis is critical, and there is no choice of transform that is better thanall others, as it depends on the type of the signal. For instance, the Fourier basis has globalsupport and is good for sine like signals, but performs poor for localized signals. On the otherhand, a Wavelet basis has local support but is poor for non-vanishing signals.

7.1 Comparison to Principle Component Analysis

We can compress a set of vectors X = [x1, . . . ,xn] by performing PCA. That is, we computethe (centred) covariance matrix

Σ = 1n (X −M)(X −M)T

and compute its eigenvector decomposition, giving us U and Λ. We then choose the k eigenvec-tors corresponding to the largest eigenvalues as Uk, and compress a given signal x as

z = Ukx

PCA basis: Since Uk is dependent on the data, we need to transmit both the basis U and theelements of z. However, the basis is optimal for the given covariance.

On the other hand, using a xed basis, both parties (compressor and decompressor) can agreeupon a particular basis and then transmit only the non-zero elements of the transformed signal z.

26

7.2 Compressive Sensing

When we gather information (e.g., in photography, MRI technology or other applications), wedo not want to collect huge amounts of data if we compress most of it away right afterwards.Instead, compressing the data while gathering it is called compressive sensing and helps todecrease acquisition time, required battery and storage.

Given an original signal x ∈ RD which is assume to be sparse in some orthonormal basis U withK large coecients in its sparse representation z:

x = Uz such that |Iz 0| = K

The main idea is to acquire the set of M linear combinations yk = 〈wk,x〉 of the initial signalinstead of the signal itself, where M D and W = w1, . . . , wM is a matrix with linearcombinations coecients

yk = 〈wk,x〉 k ∈M

y = Wx = WUzΘ=WU

= Θz

Let all matrix elements wi,j be independent and i.i.d. random variables from a Gaussian proba-bility density function with mean zero and variance 1

D . Surprisingly, regardless of the choice oforthonormal basis U then M

M ≤ cK log(d

K)

where c is some constant, is enough to obtain a stable solution for any K-sparse and compressiblesignal.

Reconstruction the initial signal of length D having M measurements means inverting the fol-lowing system:

y = Wx = WUz = Θz

This problems seems to be ill-posed, as M D we have many more unknowns than equations.We can look at the sparsest solution such that

z∗ = arg minz‖z‖0 such that Θz = y

This can be solved using the matching pursuit algorithm

7.3 Summary of Sparse Coding

Sparse coding works by coding via orthogonal transforms. That is, given an original signal xand an orthogonal matrix A, we compute the linear transformation (change of basis) z = Axand truncate small values, giving us z. The inverse is then computed as x = AT z (sinceA−1 = AT ).

To measure the performance, we can measure the reconstruction error ‖x− x‖ or measure thesparsity of the coding vector z, i.e., nnz(z).

7.4 Overcomplete Dictionaries

No single basis (e.g., Fourier or Wavelets) is optimally sparse for all signal classes. We could usemore atoms (dictionary elements) than dimensions, i.e., make use of overcompleteness (L > D).

We have two possibilities:

27

1. We use a union of orthogonal bases [U1, . . . ,UB ] ∈ RD×D·D:

x = [U1, . . . ,UB ] · z

The coding vector z is now also larger, but we hope that it is very sparsely populated.

2. We can also use a general overcomplete dictionary with some number L > D of atoms.

When we increase the overcompleteness factor LD we increase (potentially) the sparsity of the cod-

ing, but also increase the linear dependency between atoms. We measure this linear dependencyusing the notion of coherence:

m(U) = maxi 6=j

∣∣uTi uj∣∣We have m(B) = 0 for an orthogonal basis B, and m([B,u]) ≥ 1√

Dif an atom u is added to B.

7.4.1 Signal Coding

Given an overcomplete dictionary U ∈ RD×L, solving the equation

x = U · z

is an ill-posed problem, as we have more unknowns than equations. We can add the constraintthat we want to nd the sparsest z. We solve

z∗ = arg minz‖z‖0 such that x = Uz

This is the same formulation as for compressive sensing.

7.4.2 Noisy Observations

Additive noise. Assume each dimension is independently corrupted by zero-mean Gaussiannoise with variance σ2:

x = Uz + n with nd ∼ N (0, σ2)

Then, we want to nd

z∗ = arg minz‖z‖0 such that ‖x−Uz‖2 < σ

That is, we maximize the sparsity of z while the residual (approximation error) remains belowσ.

Alternatively, we can also solve

z∗ = arg minz‖x−Uz‖2 such that ‖z‖0 ≤ K

That is, we minimize the residual while selecting less than K atoms from the dictionary.

Approximate sparse coding. We can also code the noise in terms of U :

x = Uz + n = Uz +Uy = U(z + y)

where y is the coding of the noise n in U . The noise cannot be sparsely coded in any dictionary,therefore y has many small coecients. However, the few large coecients are still due to z.

Geometry of sparse coding solution z∗. As shown in Figure 8, the orthogonal projection ofx onto the subspace spanned by selected atoms ud | z∗d 6= 0 minimizes ‖x−Uz‖2.

28

Figure 8: Geometry of sparse coding solution z∗.

7.4.3 Matching Pursuit (MP)

We consider a greedy algorithm to approximate the NP hard problem iteratively, each timetaking the action which is short-term optimal. Applied to sparse coding, we do the following:

1. Start with the zero vector z = 0 and residual r0 = x.

2. At each iteration t, take a step in the direction of the atom ud∗(t) that maximally reducesthe residual ‖x−Uz‖2.

The atom selection in iteration t is done as follows:

d∗(t) = arg maxd

∣∣〈rt,ud〉∣∣The objective of the algorithm is (as seen before):

z∗ = arg minz‖x−Uz‖2 such that ‖z‖0 ≤ K

Algorithm 2 The matching pursuit algorithm

Objective: z∗ = arg minz‖x−Uz‖2, such that ‖z‖0 ≤ K1: z ← 0; r ← x2: while ‖z‖0 < K do3: Select the atom with the maximum absolute correlation to the residual:

d∗ = arg maxd|〈r,ud〉|

4: Update the coecient vector and the residual:

zd∗ ← zd∗ + uTd∗r

r ← r −(uTd∗r

)ud∗

5: end while

7.4.4 Support Recovery of MP

Assume x = Uz has a K sparse coding

x =∑d∈∆

zdud

29

Under which condition is MP sucessful in recovering the true support, i.e., ∆MP = ∆, where

∆MP = d | z∗d 6= 0

is the set of coding atoms for the MP solution z∗. The exact recovery condition is

K <1

2

(1 +

1

m(U)

)where m(U) is the coherence. The intuition is that if the coherence is small, then explaining agenerating atom with other atoms is not sparse. Therefore, sparse coding recovers the support.We have a trade-o for increasing L; this leads to sparser coding (smaller possible K) butincreases the coherence m(U).

7.4.5 Sparse Coding for Inpainting

To use sparse coding for image inpainting, we can sparse code the known parts of the images,and then predict the missing parts by reconstructing from the sparse code. An algorithm byElad et al. (2005) works as follows:

1. Dene the diagonal masking matrixM wheremd,d = 1 if the pixel d is known andmd,d = 0if it is missing.

2. Sparse coding of the known parts in an overcomplete dictionary U :

z∗ = arg minz‖z‖0 such that ‖M(x−Uz)‖2 < σ (1)

3. Image reconstruction using the mask:

x = Mx+ (I −M)Uz∗

The question that remains is how we can modify MP to solve Equation 1. We can take an EMalgorithm, as follows:

1. Assume initial sparse coding z0 of the image.

2. Reconstruct the image using the mask

x1 = Mx+ (I −M)Uz0

3. Sparse coding of reconstructed image x1:

z1 = arg minz‖z‖0 such that

∥∥x1 −Uz∥∥

2< σ

4. Iterate steps two and three t times until convergence.

5. Image reconstruction using the mask:

x = Mx+ (I −M)Uzt

30

7.5 Dictionary Learning

So far, we have seen two variants:

• Fixed orthonormal basis. The advantage is that coding is ecient and can be done bymatrix multiplication z = UTx. However, the result is only sparse for a single class ofsignals.

• Fixed overcomplete basis. Now, sparse coding is possible for many signal classes, butnding the sparsest code requires an approximation algorithm and is problematic if thedictionary size L and the coherence m(U) are large.

Instead, we now consider learning the dictionary. This allows us to adapt the dictionary to oursignal's characteristics. However, we have to solve a matrix factorization problem

XD×N ≈ UD×L ·ZL×N

subject to the sparsity constraint on Z and atom norm constraint for U .

We have the following objectivearg minU ,Z

‖X −U ·Z‖2F

where we use the Frobenius norm ‖R‖2F =∑i,j r

2i,j (sum-square error). This objective is not

convex in both U and Z (local minima), but convex in either U or Z (unique minimum). Wecan use an iterative greedy minimization:

1. Coding step: Z(t+1) = arg minZ

∥∥∥X −U (t) ·Z∥∥∥2

F, subject to Z being sparse.

2. Dictionary update step: U (t+1) = arg minU

∥∥∥X −U ·Z(t+1)∥∥∥2

Fsubject to ‖ud‖2 = 1 for

all d = 1, . . . , D.

7.5.1 Coding Step

The residual is column separable, i.e.,

‖R‖2F =∑i,j

r2i,j =

∑j

‖rj‖22

Thus, we can perform N independent sparse coding steps. For all n = 1, . . . , N :

z(t+1)n = arg min

z‖z‖0 such that

∥∥∥xn −U (t)z∥∥∥

2≤ σ · ‖xn‖2

7.5.2 Dictionary Update

Unfortunately, the residual for the dictionary update step, i.e.,

U (t+1) = arg minU

∥∥∥X −U ·Z(t+1)∥∥∥2

F

is not separable in atoms (columns of U). Therefore, we approximate by updating one atom ata time. For all d = 1, . . . , D

31

1. Fix all atoms except ud:

U =[u

(t)1 , . . . ,ud, . . . ,u

(t)D

]2. Isolate R

(t)d , the residual that is due to atom ud:∥∥∥X − [u(t)

1 , . . . ,ud, . . . ,u(t)D

]·Z(t+1)

∥∥∥2

F

=

∥∥∥∥∥∥X −∑e 6=d

u(t)e

(z(t+1)e

)T+ u

(t)d (zd ·Z(t+1)

∥∥∥∥∥∥

2

F

=

∥∥∥∥R(t)d − ud

(z

(t+1)d

)T∥∥∥∥2

F

3. Find u∗d that minimizes R(t)d , subject to ‖u∗d‖2 = 1:

• ud(z

(t+1)d

)Tis an outer product, i.e., a matrix.

• Minimizing the residual ∥∥∥∥R(t)d − ud

(z

(t+1)d

)T∥∥∥∥2

F

by approximating R(t)d with rank-1 matrix ud

(z

(t+1)d

)T.

• Achieved by the SVD of R(t)d :

R(t)d = USV

T=∑i

siuivTi

• u∗d = u1 is the rst left-singular vector.

• ‖u∗d‖2 = 1 is naturally satised.

7.5.3 Initialization

The approach is sensitive to the choice of U (0); the initial candidate solution is optimized locallyand greedily until no progress is possible any more. Possible choices include:

1. Random atoms. Samplingu

(0)d

on the unit sphere.

i) Sample D dimensional standard normal distribution: u(0)d ∼ N (0, ID).

ii) Scale to unit length:

u(0)d ←

u(0)d∥∥∥u(0)d

∥∥∥2

2. Samples from X.

i) u(0)d ← xn where n ∼ U(1, N) is sampled uniformly.

ii) Scale to unit length.

3. Use xed overcomplete dictionary, e.g., overcomplete DCT.

32

stefan heule august 22, 2011summaries.stefanheule.com/download/cil.pdffigure 3: left and right...

Documents