learning latent features with infinite nonnegative binary...

450 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 2, NO. 6, DECEMBER 2018

Learning Latent Features With Infinite NonnegativeBinary Matrix Trifactorization

Xi Yang , Kaizhu Huang , Rui Zhang, and Amir Hussain

Abstract—Nonnegative matrix factorization (NMF) has beenwidely exploited in many computational intelligence and patternrecognition problems. In particular, it can be used to extract latentfeatures from data. However, previous NMF models often assumea fixed number of features, which are normally tuned and searchedusing a trial and error approach. Learning binary features is alsodifficult, since the binary matrix posits a more challenging op-timization problem. In this paper, we propose a new Bayesianmodel, termed the infinite nonnegative binary matrix trifactor-ization (iNBMT) model. This can automatically learn both latentbinary features and feature numbers, based on the Indian buffetprocess (IBP). It exploits a trifactorization process that decom-poses the nonnegative matrix into a product of three components:two binary matrices and a nonnegative real matrix. In contrastto traditional bifactorization, trifactorization can better reveal la-tent structures among samples and features. Specifically, an IBPprior is imposed on two infinite binary matrices, while a truncatedGaussian distribution is assumed on the weight matrix. To optimizethe model, we develop a modified variational-Bayesian algorithm,with iteration complexity one order lower than the recently pro-posed maximization-expectation-IBP model [1] and the correlatedIBP-IBP model [2]. A series of simulation experiments are car-ried out, both qualitatively and quantitatively, using benchmarkfeature extraction, reconstruction, and clustering tasks. Compar-ative results show that our proposed iNBMT model significantlyoutperforms state-of-the-art algorithms on a range of syntheticand real-world data. The new Bayesian model can thus serve as abenchmark technique for the computational intelligence researchcommunity.

Index Terms—Infinite non-negative binary matrix tri-factori-zation, Infinite latent feature model, Indian Buffet Process prior.

I. INTRODUCTION

NON-NEGATIVE matrix factorization (NMF) is a well-known matrix decomposition technique that has been

Manuscript received August 13, 2017; revised October 23, 2017 and Decem-ber 13, 2017; accepted January 11, 2018. Date of publication March 21, 2018;date of current version November 21, 2018. This work is supported in part by theNational Natural Science Foundation of China under Grant 61473236, in partby the Natural Science Fund for Colleges and Universities in Jiangsu Provinceunder Grant 17KJD520010, in part by the Suzhou Science and Technology Pro-gram under Grant SYG201712 and Grant SZS201613, in part by the JiangsuUniversity Natural Science Research Programme under Grant 17KJB520041,in part by the Key Program Special Fund in XJTLU (KSF-A-01), and in partby the UK Engineering and Physical Sciences Research Council under GrantEP/M026981/1. (Corresponding author: Kaizhu Huang.)

X. Yang, K. Huang, and R. Zhang are with the Xi’an Jiaotong-LiverpoolUniversity, Suzhou, Jiangsu 215123, China (e-mail: [email protected];[email protected]; [email protected]).

A. Hussain is with the Division of Computing Science & Maths, Schoolof Natural Sciences, University of Stirling, Stirling FK9 4LA, U.K. (e-mail:[email protected]).

Digital Object Identifier 10.1109/TETCI.2018.2806934

widely applied in data analysis,computational intelligence andmachine learning [3]. Essentially, NMF decomposes the ob-servation data to learn a base vector, which contains informa-tion on the latent features of the object. The human brain alsorecognises objects based on high-level latent features and struc-tures. In other words, NMF can be exploited to reveal latentstructures from observations, similar to the brains cognitive ca-pability [4]. Consequently, traditional algorithms are designedfor feature extraction or one-sided clustering. For instance, Dinget al. [5] proposed a relaxed K-means clustering approach basedon NMF. In many real-world applications, people are interestedin addressing objects of multiple types with features of muchricher structures, also termed dyadic data analysis [6]. To copewith such data, it becomes important to learn the interactionamong features and leverage the interrelation between data andfeatures. This is particularly the case in co-clustering or col-laborative filtering [7]–[9]. In such cases, the NMF with twofactors can be restrictive and often provides poor low-rank ma-trix approximation. Thus, a new factor is needed to absorb theadditional scale.

To this end, an emerging technique based on Non-negativeMatrix Tri-factorization (NMTF) has recently gained much at-tention [10]. For instance, Zhang et al. [11] introduced NMTFfor biclustering structures, whilst Wang et al. [12] developed aFast NMTF (FNMTF) method for large-scale data co-clustering.More recently, Wang et al. [13] proposed a Penalty NMTF(PNMT) based approach by introducing three penalty con-straints. However, in all these feature-based models, a fixednumber of latent features or clusters is generally assumed. Op-timisation of results usually requires this number to be tunedor searched by trial and error. Further, in practice, factor ma-trices are often required in binary form, since binary featuresare cheaper to compute and more compact to store. Binary fea-tures can also appear in various types of data, such as binaryimages, and number of words occurring in an article [14], [15].In this scenario, effective NMF or NMTF formulations becomemore challenging, since binary matrices usually pose multipleoptimization demands.

To address the above problems, in this paper, we ex-tend the standard NMF to learn binary features with a novelBayesian model, termed infinite non-negative binary matrix tri-factorization (iNBMT). In contrast to the traditional NMF, thenovel iNBMT model can automatically select an optimal fea-ture set from infinite latent features, by applying the IndianBuffet Process (IBP) prior to factor matrices. Further, we de-compose the input sample matrix Y into triple matrix factors

2471-285X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

https://orcid.org/0000-0002-8600-2570

https://orcid.org/0000-0002-3034-9639

https://orcid.org/0000-0002-8080-082X

YANG et al.: LEARNING LATENT FEATURES WITH INFINITE NONNEGATIVE BINARY MATRIX TRIFACTORIZATION 451

Fig. 1. Comparison of different models. Y is the observed data, and W is the real matrix (or non-negative matrix in the iNBMT and the ME-IBP). Z and Xare binary matrices (similar to U and V in the IBP-IBP). θ, α, λ, μ, σ are fixed parameters. (a) ME-IBP: Bi-factorization with a non-negative matrix and a binarymatrix. (b) Our proposed iNBMF method: Tri-factorization with a non-negative real matrix and two binary matrices. (c) IBP-IBP: Hierarchical structure includesa real matrix and two binary matrices (with constraints).

i.e., Y = ZWXT , where Z and X are two binary matrices,and the non-negative matrix W can be considered a weightmatrix. In comparison with bi-factorization, which is typicallyinvolved in NMF, tri-factorization can better capture latent fea-tures and reveal hidden structures underlying the samples [10].For illustration, we plot our proposed novel iNBMT method asa graphic model in Fig. 1(b). Compared to the basic IBP basedmodel which involves two factors (bi-factorization) Z and W inFig. 1(a), the proposed model employs one more binary matrixX with the IBP prior. The binary matrix Z can be considered tolearn latent features, whilst the other binary matrix X extractshidden structures of objects. Furthermore, the interrelation isleveraged by a non-negative weight matrix W which is alsoused to adjust the intensity of the features. Another well-knownIBP-based method, termed correlated IBP-IBP (or IBP-IBP) isshown in Fig. 1(c). This model is also a tri-factorization basedmethod, which reveals relationships between categories and fea-tures by defining a category assignment matrix U, and a set ofcategory-feature relations V. However, this approach requiresa number of constraints and is consequently less flexible. Moredetails are outlined in Section IV.

Approximation of the posterior in IBP is usually realisedusing Gibbs sampling. However, for large-scale matrices, vari-ational methods can attain better performance than Gibbs sam-pling. This is due to use of samplers in the latter, that oftenlead to mixing problems with growth in dimensionality [16].Further, uncollapsed Gibbs samplers can easily get stuck inlocal optima. To circumvent this, we propose an efficient, mod-ified variational-Bayes (VB) algorithm to fit the massive matrixdecomposition, which can be thought of as a maximization-expectation algorithm (ME-algorithm) [17]. More importantly,the time complexity of our proposed ME-algorithm is proved tobe one order lower than other state-of-the-art models, such as theMaximization-Expectation-IBP (ME-IBP) [1] and the IBP-IBPmodel [2].

There are a number of other related methods reported in theliterature for learning latent binary features. For example, Zhanget al. extended the standard NMF to Binary Matrix Factorization(BMF) for producing biclustering structures [11]; the correlatedIBP-IBP model [2] is also able to generate binary features.However, both these two models are limited. The BMF requiresinput data to be strictly binary, which is too strong an assumptionin real cases. The IBP-IBP model [2] enforces a product of two

binary matrices to be binary, such assumption being invalidin general. Another recently-proposed ME-IBP model [1] is abi-factorization approach, which does not consider relationalentities.

In summary, the contributions of this paper can be outlinedbelow:

� An IBP based infinite NMF model is proposed, whichis able to automatically determine an optimal number offeatures.

� The proposed model offers tri-factorization and is able todeliver latent binary features, addressing a more challeng-ing optimization problem.

� No extra constraints are enforced, presenting more appeal-ing features when compared with other models used togenerate binary features, such as the BMF model [11] andIBP-IBP model [2].

� The iteration complexity is one order lower than competi-tive IBP-based models.

� The proposed model outperforms state-of-the-art methods(such as the ME-IBP, correlated IBP-IBP, and PNMT) onboth synthetic and real data.

� This paper is an extension of [18], with a reorganised anddetailed method formulation, redesigned experiments, ap-plication to additional benchmark datasets, and compara-tive evaluation with more state-of-the-art methods.

The rest of this paper is organized as follows. The Indianbuffet process and maximization-expectation algorithm are in-troduced as preliminaries in Section II. A detailed descriptionof our proposed iNBMT model is presented in Section III. InSection IV, related work is briefly reviewed. In particular, thedifference between our proposed method and other relevantwork is emphasised. In Section V, the experimental setup ispresented, including visualization and quantitative results onseven datasets including a synthetic dataset, and six real-worlddatasets. A complexity analysis is carried out and comparedwith two related IBP-based methods in Section VI. Finally,concluding remarks, limitations and future work suggestionsare presented in Section VII.

II. BACKGROUND

This section presents an overview of latent feature learn-ing through matrix factorization, the infinite IBP prior and MEalgorithm, as well as notations used throughout the paper.


A. Learning Latent Feature via Matrix Factorization

A latent feature model refers to a model where each object isassociated with a subset of possible latent features. In this familyof methods, each latent feature can influence the observations.Moreover, multiple latent features can be activated simultane-ously. In a probabilistic latent feature model, the marginal dis-tribution is written as follows:

p(Y) =∑

F

p(Y,F) =∑

F

p(Y|F)p(F), (1)

where Y represents real-valued observed objects, the matrixF is used to indicate latent features, p(Y|F) determines theprobability of objects conditioned on these features, and p(F)is the probability that objects are associated with each feature.In methods based on matrix factorization, the latent features Fare decomposed to F = Z

⊗W, where

⊗denotes element-

wise product, Z determines which features are assigned to eachobject, and W indicates the value of each latent feature for eachobject. By specifying priors for Z and W, the prior on F canbe defined as p(F) = p(Z)p(W) when these two componentsare independent. Therefore, the focus becomes how to identify aprior on Z, and hence determine the effective dimensionality oflatent features. IBP is often used as the prior on infinite binarymatrices. An IBP prior on Z can make the limit of number offeatures close to infinity. In the next subsection, we introducethe concepts of IBP as well as IBP prior.

B. Indian Buffet Process

IBP is termed a sequential process used to represent an in-finitely exchangeable distribution of a stochastic process [19].IBP can be simply described as N customers orderly entering abuffet restaurant which arranges infinite dishes. A customer cantake a Poisson(α) number of dishes. The ith entered customerchooses the dish k with probability mk

i , where mk is the num-ber of customers selecting this dish previously, and can takePoisson(α

i ) new dishes after the end of all previously sam-pled dishes. Note that the order of customers is exchangeable.Further, this process reveals that IBP assumes an unboundednumber of features, but the observed objects represent only afinite subset of features. IBP can be considered a prior for in-finite binary matrices defined in these models. It is typicallyused to infer the number of latent features each observationpossesses. We assume observed objects (N objects with D at-tributes) Y ∈ RN×D are generated by linear combination ofan assignment matrix Z ∈ RN×K and a matrix W ∈ RK×D

(containing K latent factors). As a consequence, a latent featuremodel is portrayed as Y = ZW + ε. ε is zero-mean, indepen-dently distributed Gaussian noise.

Assume an element of the binary matrix znk = 1 indicatesthat the object n has the latent feature k, k = 1, . . . ,K. Eachcolumn of binary feature matrix Z is assumed to be over anIBP prior, derived independently by placing Beta priors from aBernoulli distribution. The feature k is assigned to each objectwith probability πk , which is generated independently from aBernoulli distribution and the bias πk is independently generated

by a Beta prior, over each column 1

πk | (α) ∼ Beta(α/K, 1) , Z | πk ∼ Bernoulli(πk ),

p([Z]) =K∏

k=1

αK Γ(mk + α

K )Γ(N −mk + 1)Γ(N + 1 + α

K ). (2)

Here Γ denotes the gamma function, α is the IBP strength pa-rameter, and mk =

∑Nn=1 znk is used to count the number of

objects possessing feature k.For infinite models, several classic matrix factorization mod-

els have been developed as IBP inspired infinite-limit versions,for instance, infinite ICA models [20]. Intuitively, an infinitelimit implies that the probability of Z will be specified in theinfinite classes. To this end, Griffiths et al. made the numberof attributes unbounded by proposing equivalence classes overbinary matrices, in order to take the IBP prior into the infinitelimit [21]. More importantly, since customers and dishes are ex-changeable, following the principle of IBP, equivalence classescan be used to permutate the order of columns by eliminatingall the null columns. Consequently, the infinite number of activefeatures K+ means K is unbounded, which is learned from data,while remaining finite with probability one. By rearranging thenon-zero columns of Z, we can specify K →∞ and modifyEq. (2) as follows:

p([Z]) = αK +∏h > 0 Kh

e−αHN∏K+

k=1(N−mk )!(mk −1)!

N ! , (3)

where HN =∑N

j=11j denotes the N th harmonic number and

Kh represents the number of non-zero rows. Moreover, Kh andmk are both irrelevant to the objects sequence, which provesthat p([Z]) is an infinitely exchangeable distribution. In line withthis property, IBP has been shown to be useful for binary factoranalysis, such as modeling protein interactions, and similarityjudgments [22], [23]. It has also been applied in other fields suchas choice behavior modeling, link prediction, and dictionarylearning for correlated observations [24]–[27].

C. Maximization-Expectation Algorithm

The variational Bayesian (VB) paradigm, as the basis of ourproposed algorithm, has the ability to automatically select anoptimal number of clusters from observations [28], [29]. The ap-proximation process is an Expectation-Maximization (EM)-likemethod, alternating between estimations of cluster assignmentsand stochastic parameters. Kurihara at al. further modified VB-clustering with a fast implementation, termed the Maximization-Expectation (ME) algorithm [17]. The ME algorithm simplyreverses the roles of two steps in the classical EM algorithm, bymaximizing over hidden variables and marginalizing over ran-dom parameters. We consider a probabilistic model p(Y,Z,W)with observations Y and hidden random variables Z and W. Tocarry out approximate Maximum-a-Posteriori (MAP) inference,it is necessary to compute posterior or marginal probabilitiessuch as p(Z|Y), p(W|Y), or p(Y). In Mean-Field VariationalBayes (MFVB) approximation, the posterior p(Z,W|Y) cannot

1A Beta prior is placed on πk with the shape parameter α/K for all k andscale parameter 1.


Fig. 2. Representation of the iNBMT model. The process f (·) applied to thelinear inner product of the three components. Here Z and X are infinite binarymatrices, W represents a non-negative matrix.

be computed analytically. Therefore, by assuming independentvariational distributions, the posterior distribution can be factor-ized to variational distributions: p(Z,W|Y) ≈ q(Z)q(W) [30].These results are then iteratively updated as follows:

q(Z) ∝ exp(E[ln p(Z,W | Y)q(W ) ])

↔ q(W) ∝ exp(E[ln p(Z,W | Y)q(Z) ]).

Here, the symbol ↔ implies q(Z) and q(W) are updated it-eratively. The VB approximation is based on minimizing theKullback-Leibler (KL)-divergence: KL[q(Z)q(W)‖p(Z,W|Y)]. As a special case of MFVB, the ME algorithm maximizeslatent variables and then applies expectations over parameters.The results are close-formed with updates as follows:

q(W) = p(W | Z∗,Y)

↔ Z∗ = arg maxZ

E[ln p(Z,W | Y)q(W ) ]. (4)

III. INFINITE NON-NEGATIVE BINARY MATRIX

TRI-FACTORIZATION

In this section, we present our proposed iNBMT model whichexploits IBP priors to associate an item and attributes with morethan one cluster. We first describe the model and its formula-tion, and then show how to employ a modified, efficient ME-algorithm to perform approximate MAP inference.

A. Model Description

The iNBMT model is applied to real-valued observation dataY ∈ RN×D with exchangeable rows and columns. Consider-ing a probabilistic latent feature model in Eq. (1), our focus ison latent features p(F), the difference being that these are fur-ther decomposed into three components: F = Z

⊗W

⊗X. In

this feature-based model, for case of L latent features, X is anL×D binary feature matrix. Furthermore, the Kth potential bi-nary vector zi denotes the feature vector corresponding to entityi, and the K × L interactive weight matrix W represents theprimary parameters. Assuming the three components of F areindependent, the priors of the features are defined by: p(F) =p(Z)p(W)p(X). In Fig. 2, the iNBMT model is illustrated pic-torially, where the observations Y are represented by ZWXT

depending on a fixed observation distribution f (·). This processis equivalent to factorization or approximation of the data.

Y | Z,W,X ∼ f (ZWXT ,θ) ,

where θ are hyperparameters specific to the model variant.Our focus is to learn the latent features automatically by plac-

ing Bayesian non-parametric priors on binary matrices. Unlike

the matrix factorization method, our tri-factorization methoddoes not need to place an upper bound on the number of fea-tures, or on the number of clusters. In infinite models, both thebinary matrices Z and X are assumed to be matrices with anunbounded number of columns. Specifically, the IBP priors areimposed over infinite binary matrices with the property that, afinite number of entities will only have a finite number of non-zero features, with a probability of 1. Thus, the binary matrixalways has a positive probability under the IBP prior. Our basicgenerative model can then be shown as below:

Z ∼ IBP(α), X ∼ IBP(λ), W ∼ F(W;μ,σ2W ).

Here, any non-negative prior F (e.g., exponential and truncatedGaussian) is assumed on the weight matrix W. The IBP prior isstated in Eq. (2), and the hyperparameters θ conjugate gammapriors on inference parameters. In our iNBMT model, each ob-ject possesses multiple latent features, and each latent feature isalso assigned to numerous latent classes by using two IBP priors.Moreover, both latent features and latent classes are associatedwith a distribution over attributes. Thus, our proposed methodcan be used for discovering fundamental hidden structures incomplex data.

B. Linear-Gaussian iNBMT Model: Formulation

In this section, we derive linear-Gaussian as an observationdistribution, with mean ZWXT and covariance (1/θ)I for cap-turing the latent features.

The Gaussian distribution of Y given A = {Z, W, X} andσ2

Y is shown as below:

p(Y|A,σ2Y ) = 1

(2πσ2Y )

N D2

exp −tr((Y−E[Y ])T (Y−E[Y ]))2σ2

Y,

where E[Y] = ZWXT . The linear-Gaussian iNBMT modelcan be considered a two-sided version of the linear-Gaussianmodel. The truncated Gaussian (TN ) prior is placed on the non-negative interactive matrix W, with mean zero and covariancematrix σ2

W

p(W|0,σ2W ) =

K∏

k=1

L∏

l=1

TN(wkl ; 0,σ2W ).

According to Eq. (3), the marginal probabilities p([Z]) andp([X]) are specified with the infinite IBP prior:

p(Z|α) =αK+

K+!e−αHN

K+∏

k=1

(N −mk )!(mk − 1)!N !

.

Here, mk =∑N

n=1 znk , the p(X|λ) follows the same formulastructure with N ← D, k ← l (l = 1, . . . , L) and mk ← sl =∑D

d=1 xdl (← denotes a substitution).From the Bayesian theorem, the likelihood can be written as

follows:

p(Y,A|θ) = p(Y|A,σ2Y )p(W|0,σ2

W )p(Z|α)p(X|λ).

We assume the hyperparameters θ = {α, λ, σY , σW } are esti-mated from the data. By placing conjugate gamma hyper-priors


on these parameters, we can employ a straightforward extensionto infer their values [29]. Formally,2

p(α) ∼ G(aα , bα ), p(λ) ∼ G(aλ, bλ),

p(σY ) ∼ IG(aσY, bσY

), p(σW ) ∼ IG(aσW, bσW

).

In the above, G denotes the Gamma prior, and IG refers to theinverse Gamma prior.

C. Linear-Gaussian iNBMT Model: Variational InferenceProcedure

Here, the variational inference procedure is presented for thelinear-Gaussian iNBMT model. Consider a model with obser-vations Y, hidden variables A = {Z,W,X}, and hyperparam-eters θ. In the optimization stage, these variables often workwith the log-marginal likelihood of the observations:

log p(Y|θ) = log∫

p(Y|A,θ)dA .

However, the log-marginal probability is difficult to compute,which implies the true log-posterior calculation is also in-tractable. In order to approximate the true posterior, the meanfield variational method is developed with a variational distri-bution qν (A) (where ν is the variational parameter). Inferenceis then applied on the variational distribution, by optimizing theKL divergence. In particular, the aim is to minimize the KLdivergence D(q ‖ p) between qν (A) and p(A|Y,θ).

D(q ‖ p)=log p(Y|θ)+Eq [log qν (A)]− Eq [log p(Y,A|θ)].

Alternatively, this is equivalent to maximizing a lower boundon the log-marginal likelihood:

ln p(Y|θ) = Eq [log p(Y,A|θ)]− Eq [log qν (A)] + D(q ‖ p)

≥ Eq [ln p(Y,Z,W,X|θ)] +H[q] ≡ T , (5)

where H[q] is the entropy of q. Importantly, the approximateMAP inference is derived from the ME framework by followingEq. (4), i.e., maximizing the latent variables and taking expecta-tions over variational parameters. In the linear-Gaussian iNBMTmodel, the lower bound of evidence, T , is written as follows

T ≡ 1σ2

Y

[−1

2(ZE[W]XT )(ZE[W]XT )T

+Z(E[W]YT + Zγ)XT

]

+ log p(Z|α) + log p(X|λ) +K+∑

k=1

L+∑

l=1

ϕkl + const, (6)

with

γ =12

K∑

k=1

L∑

l=1

[E[wkl ]2 − E[w2kl ]]

T ,

ϕkl = −KL

2ln

(πσ2

W

2

)− E[w2

kl ]2σ2

W

+H(q(wkl)).

2A random variable X follows a gamma-distributed, X ∼ Γ(α, β) ≡Gamma(α, β), where α is shape parameter and β is rate parameter

Algorithm 1: Parameter Updating.Initial:Give upper bounds of K and L.Generate the N ×K binary matrix Z, L×D binarymatrix X and K × L non-negative matrix W randomly.Repeat

for i = 1 to N doOptimize the binary matrix Z (in appendices A) byreducing the number of K with left-order form.Resize and update the matrix W based on theresized Z (in appendices B).

for j = 1 to D doOptimize the binary matrix X (in appendices A) byreducing the number of L with left-order form.Resize and update the matrix W based on theresized X (in appendices B).

Calculate the Log-likelihood LogL.

Until |LogLnew − LogLold | < threshold value.

Here E[W] is a matrix with each element defined as E[wkl ].The variational parameters updating is described in the nextsubsection.

D. Linear-Gaussian iNBMT Model: Parameter Updating

Our proposed modified ME-algorithm adopts VB evidenceto iteratively optimise a lower bound on the model marginallikelihood. This optimization algorithm is particularly tailoredto tri-factorization, which has not been exploited previouslyin the literature. Specifically, the major improvement over theprevious ME algorithm is that we design two loops in eachiteration with a guaranteed convergence, which enables the al-gorithm to update two different binary matrices. Note that theprevious ME algorithm can merely be used for bi-factorization.In addition, compared with other popular Gibbs-sampler opti-mization algorithms [21], our proposed algorithm engages VBevidence for inference. The latter is theoretically more rigorousthan Gibbs-sampler based algorithms and usually leads to bet-ter performance [16]. Algorithm 1 enumerates the steps of ourproposed modified ME algorithm. Specifically, parameters as-sociated with the two infinite variables Z and X are updated, inturn. For completeness, these VB update equations are providedin Appendix.

IV. RELATED WORK

IBP is frequently used by computational intelligence and ma-chine learning communities. In particular, several feature-basedmodels have been reported that exploit the IBP [20], [31]. In thefollowing, we mainly introduce factorization methods utilizingIBP, that are closely related to our proposed model.

The first benchmark, infinite latent-feature model based onIBP priors, was proposed by Griffiths et al. [32]. Reed et al.introduced the linear-Gaussian IBP model by employing amaximization-expectation framework to approximate the MAPinference (ME-IBP) [1], [17]. This particular model can be


regarded as a latent factor model in which the binary matrixZ linearly combines the latent factors W. The graphic struc-ture of ME-IBP can be seen in Fig. 1(a). As can be seen, theME-IBP model draws Z from an IBP prior, with the strengthparameter αz . The full IBP model is shown in Eq. (2). The priorover the non-negative matrix W is an independent, identically-distributed, truncated-Gaussian with zero mean and covarianceσ2

W . Further, Z can be optimized by maximizing a submodulecost function:

Z ∼ IBP(αz ), W ∼ T (W; 0,σ2W ).

The ME-IBP is known to be significantly slower than many othermethods reported in the literature, and is apparently unable tolearn binary features in bi-factorization settings.

Some researchers [33], [34] have attempted to extend the ba-sic IBP model, by considering the correlation between latentfeatures and observations. In particular, the correlated IBP-IBPmodel employs a correlation framework for feature-based mod-els [2]. This model is perhaps the most relevant for benchmark-ing our proposed method. The tri-factorization of this model isillustrated using a hierarchical structure in Fig. 1(c). As can beseen from this hierarchical structure, the feature assignmentsmatrix Z is further decomposed into two binary matrices. Fur-ther, in this IBP-IBP model, the IBP prior is drawn on bothbinary variables, and the Bernoulli function is set as the linkfunction. The latent feature matrix W follows an exponentialprior, as below:

U∼IBP(αu ),V∼IBP(αv ), znk∼Bernoulli(1− quTn vk ).

Here, αu , and αv are the strength parameters, and q ∈ [0, 1] isthe noise parameter. In the IBP-IBP model, un is the nth row ofthe binary category matrix U, and vk is the kth column of thebinary matrix V, which denotes the category-feature relations.znk = 1 means the feature k is presented in observation n. Itis worth noting that, U, V, and Z are all considered as thebinary matrix, forcing observations to be associated with onlyone category. This constraint limits the model’s flexibility. Thus,the weakness of the IBP-IBP model is on identifiability issuescommonly associated with feature-based models.

There are also other NMTF methods that do not adopt IBP.These models are usually based on a discriminative model [10]aiming to minimize the following objective:

arg min ‖ Y − ZWXT ‖2 , s.t. Z ∈ RD×l+ ,X ∈ RN×k

+ .

Here, Y is a D ×N observation matrix; l and k represent thenumber of object clusters and feature clusters separately. Foraddressing co-clustering, this method was further developed byconstraining the factors Z and X [12]. More recently, PNMTfurther decomposed each of the factors with three penalty-termconstraints [13].

V. EXPERIMENTS

In this section, we experimentally evaluate our proposediNBMT model on five datasets for tasks of feature extrac-tion, reconstruction and pre-image restoration. We comparethe performance of iNBMT with three state-of-art algorithms:

TABLE IDESCRIPTION OF SEVEN DATASETS USED IN THE EXPERIMENT

Datasets Noise term Training Testing Dimensions Classes

Syn σY = 0.8 4500 – 36 –Com-USPS σY = 0.8 2000 – 1024 –Com-NIST σY = 0.8 400 – 4096 –Pre-USPS – 400 20 256 –Pre-NIST – 600 15 1024 –Coil-20 – 1440 – 4096 20UMIST – 575 – 644 20

Maximization-Expectation-IBP (ME-IBP), correlated IBP-IBP(IBP-IBP) and Penalty NMTF (PNMT). In addition, the clus-tering performance of our proposed method is evaluated on twobenchmark datasets and compared with four related methods,including K-means, ME-IBP, PNMT and Fast NMTF (FNMTF).

A. Datasets

We first evaluate our method on a synthetic dataset and thenconduct experiments on six datasets obtained from real tasks.For the synthetic data, we modify the one used in Griffiths [21].For the remaining six datasets, Com-USPS, Pre-USPS, Com-NIST, and Pre-NIST are reused from the well-known USPS andNIST datasets respectively, whereas Coil-20 and UMIST aretwo widely used, benchmark clustering datasets. Informationon these seven datasets is given in Table I, where Y denotesthe observations and σY indicates the variance of the Gaus-sian noise. The first three datasets, i.e., Syn, Com-USPS, andCom-NIST, are used for evaluating the performance of differentmethods for feature extraction and reconstruction tasks, whilethe Pre-USPS and Pre-NIST are used to validate various mod-els for the pre-image restoration task. All the five data sets areavailable online at: https://github.com/zzy8989/Data-iNBMT.Finally, the Coil-20 and UMIST are used to evaluate the clus-tering performance.

1) Synthetic Data: The synthetic dataset was generated bymodifying the dataset used in Griffiths [21]. Specifically, ourdataset comprises 6× 6 grey images adapted via three differentluminance levels, as illustrated in Fig. 6(a). Each row of the ob-servations Y is generated using Z to linearly combine a subsetof four binary factors X [see Fig. 3(a)]. In addition, W loads dif-ferent luminance combinations. The modified dataset presentsa more challenging problem and appears more appropriate forevaluating the different methods.

2) Com-USPS and Pre-USPS Data: We generated twodatasets from USPS: the Com-USPS and the Pre-USPS. Thedigits, used in these two datasets, are sampled randomly fromthe USPS. Moreover, our generated datasets are scaled to [0, 1].Each row of the Com-USPS dataset is built up with 32× 32grey images. Various kinds of digits 0, 1, 2, 3 are combined intoeach sample, as illustrated in Fig. 7(a). The Pre-USPS datasetcontains merely a single handwritten digit, which is also chosenrandomly from 0, 1, 2, 3. In the training set, each digit has 100samples. Furthermore, in order to see if the various methods can


Fig. 3. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on the synthetic dataset. iNBMT perfectly matches the truth features. (a) Ground-truth latentfeatures. (b) Features learned by iNBMT. (c) Features learned by ME-IBP. (d) Features learned by IBP-IBP. (e) Features learned by PNMT.

Fig. 4. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on Com-USPS dataset. The first sub-figure shows 9 examples of training data. The other fourshow the features learned by each method. iNBMT clearly shows the best performance. (a) 9 examples of training data without noise. (b) Features learned byiNBMT. (c) Features learned by ME-IBP. (d) Features learned by IBP-IBP. (e) Features learned by PNMT.

Fig. 5. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on Com-NIST dataset. The first sub-figure shows 9 examples of training data and the others showthe features learned by each method. iNBMT shows the best performance. (a) 9 examples of training data without noise. (b) Features learned by iNBMT. (c)Features learned by ME-IBP. (d) Features learned by IBP-IBP. (e) Features learned by PNMT.

Fig. 6. Comparison of reconstruction of synthetic data. iNBMT best matches the groundtruth than ME-IBP, IBP-IBP, and PNMT. (a) Groundtruth without noise.(b) Groundtruth. (c) The reconstruction by iNBMT. (d) The reconstruction by ME-IBP. (e) The reconstruction by IBP-IBP. (f) The reconstruction by PNMT.

restore an image, the test samples are bottom-halved from theoriginal images [see Fig. 10(b)].

3) Com-NIST and Pre-NIST Data: The rest two datasetswere both generated from NIST handprinted forms and char-

acters database. It is worth mentioning that all the images arebinary in this database. In the way same as generating Com-USPS, we generated the Com-NIST dataset so that each sampleof the Com-NIST consists of 64× 64 binary images combined


Fig. 7. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on the Com-USPS dataset. The first sub-figure shows the groundtruth image without noise.The iNBMT can be clearly seen to exhibit the best performance. (a) Groundtruth without noise. (b) Corrupted groundtruth. (c) Reconstruction by iNBMT.(d) Reconstruction by ME-IBP. (e) Reconstruction by IBP-IBP. (f) Reconstruction by PNMT.

Fig. 8. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on the Com-NIST dataset. The first sub-figure shows the groundtruth image without noise. The3rd to 6th sub-figures show the reconstruction result. iNBMT clearly produces the best performance. (a) Groundtruth without noise. (b) Corrupted groundtruth.(c) Reconstruction by iNBMT. (d) Reconstruction by ME-IBP. (e) Reconstruction by IBP-IBP. (f) Reconstruction by PNMT.

Fig. 9. Illustration of von-Neumann divergence measure, which is moreconsistent with human visual perception than MAE and MSE. (a) Image A.(b) Image I. (c) Image B.

from letters a, b, c, d [see Fig. 8(a)]. On the other hand, in thePre-NIST dataset, each sample contains a single handwritten

TABLE IIRECONSTRUCTION RESULTS BY VON-NEUMANN DIVERGENCE

(THE SMALLER THE VALUE, THE BETTER)

Datasets Syn Com-NIST Com-USPS

iNBMT 6.4530 310.2899 2598.5233ME-IBP 33.4758 1556.6001 5027.6282IBP-IBP 60.4408 2416.1324 16893.2936PNMT 216.1765 1958.5674 3737.1268

letter chosen randomly from a, c, d. In the training set, each let-ter has 200 samples. Similarly, the test samples are top-halved


Fig. 10. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on the Pre-USPS dataset for pre-image restoration. The first row shows 20 samples of trainingdata, and test images with five halved letters (0, 1, 2, 3) are showing in each row respectively. The second row demonstrates the features learned by each method.The third row shows restoration results. Digits with red boxes are incorrectly restored. (a) 20 samples of training data. (b) Data for testing. (c) Features learnedby iNBMT. (d) Features learned by ME-IBP. (e) Features learned by IBP-IBP. (f) Features learned by PNMT. (g) Restored by iNBMT. (h) Restored by ME-IBP.(i) Restored by IBP-IBP. (j) Restored by PNMT.

from the original images so as to validate if the various methodscan restore them [see Fig. 11(b)].

4) Coil-20-Product and UMIST: Coil-20-product is an ob-ject recognition benchmark dataset consisting of grayscalingimages from 20 objects. These objects have diversified reflec-tion properties and complex geometric. Each object was rotated360 degrees by the turntable, and 72 images were taken perobject (rotated once every 5 degrees) [35].

UMIST is a face recognition benchmark dataset containing575 images from 20 different subjects. Each subject is with thedifferent view. Each image is rescaled to 28× 23.3

B. Feature Extraction

The three datasets Syn, Com-USPS, and Com-NIST are usedto evaluate if different methods can extract the latent features.

Fig. 3 shows the results on the Syn dataset. Fig. 3(a) providesthe groundtruth latent features. Fig. 3(c) shows the inferredfeatures by the ME-IBP, which matches well the truth features.It is however noted that each feature is repeated twice withcertain noises. Compared with the ME-IBP, the learning featuresof the IBP-IBP are shown in Fig. 3(d), where the features arealso repeated. In Fig. 3(e), the inferred features by the PNMT

3https://www.sheffield.ac.uk/eee/research/iel/research/face

match the truth features, but with a lot of noise. In Fig. 3(b),it is evident that the iNBMT outperforms the above three. Itperfectly matches the truth features as well as identifying thefeature number automatically.

We next report the performance on the Com-USPS. The re-sults are shown in Fig. 4. We presented the 9 input images asexamples without noise for comparison [see Fig. 4(a)]. It is in-teresting to see from Fig. 4(b), our proposed iNBMT not onlycaptures the latent features, i.e., each of the clean digits, butalso captures their image contours. In Fig. 4(e), PNMT alsocaptures the single digits, but the learned features have morenoise. On the other hand, the inferred features of the ME-IBPseem good, but these features are repeated many times, as shownin Fig. 4(c). In Fig. 4(d), the learned features of the IBP-IBP areprovided. Apparently, its performance is not as good as our pro-posed iNBMT, and it is also incapable of obtaining the featureof digit 1.

We then present the evaluation results on the Com-NIST datain Fig. 5. It is evident that all the algorithms cannot filter noiseperfectly. Nonetheless, iNBMT still learned the underlying fea-ture (the single letter) as clearly observed in Fig. 5(b). TheME-IBP and PNMT also extracted clear letters, but their resultscontained merely combinations of letters [see Fig. 5(c) and (e)].It is usually tricky to cover all the possible combinations. Conse-quently, its performance is often much worse than the proposed


Fig. 11. Comparison of iNBMT, ME-IBP, IBP-IBP, and PNMT on the Pre-NIST dataset for pre-image restoration. The first row shows 15 samples of trainingdata, and test images with five halved letters (a, c, d) are showing in each row respectively. The second row demonstrates the features learned by each method.The third row shows restoration results. Letters with red boxes are incorrectly restored. (a) 15 samples of training data. (b) Data for testing. (c) Features learnedby iNBMT. (d) Features learned by ME-IBP. (e) Features learned by IBP-IBP. (f) Features learned by PNMT. (g) Restored by iNBMT. (h) Restored by ME-IBP.(i) Restored by IBP-IBP. (j) Restored by PNMT.

iNBMT model. In summary, from the three groups of results,iNBMT demonstrates the best performance and outperforms theME-IBP, IBP-IBP, and PNMT on feature extraction.

C. Reconstruction

In this section, we report experiments to compare the recon-struction performance of different methods.

In the experimental setup, all three groups of observationsare corrupted with σY = 0.8 Gaussian noise. Examples of ran-domly generated images and their corrupted versions are illus-trated in (a) and (b) respectively, of each figure (Figs. 6–8).The reconstructed images by four algorithms are shown in (c),(d), (e), (f) respectively, in each figure. We can clearly see thatthe images reconstructed from the iNBMT are more similar tothe groundtruth. In particular, on the Com-USPS dataset, theiNBMT almost perfectly recovers the images. Significantly, theiNBMT denoising ability is also superior to that of other al-gorithms. Although several repeated features are extracted byiNBMT in the Com-NIST data (as seen in the feature extractionsection), it does not prevent the iNBMT from producing reason-ably good reconstructions of the data. In comparison, ME-IBP,IBP-IBP, and PNMT cannot clearly extract single digits or let-ters, on account of the latent features, and their reconstructionresults being worse than those of the iNBMT. Moreover, ex-ploiting the iNBMT framework, W ×XT can be considered asa set of basis images which can be added together with binarycoefficients Z to recover images. It is apparent that all digit com-

binations are correctly detected. By adjusting features that arenon-zero in each row of Z, reconstructed images are composedby adding basis images together.

In order to quantitatively evaluate the reconstruction perfor-mance of different algorithms, we exploit Von-Neumann di-vergence as a criterion to measure the similarity between thereconstructed and groundtruth images without noise [36]–[39].The von-Neumann divergence is defined as:

DvN(A1 ,A2) = tr(A1 log A1 −A1 log A2 −A1 + A2).

Here, the A1 and A2 are two matrices. The von-Neumann di-vergence (NvD) has been shown to preserve the geometry in-formation better when two images or matrices are compared. Itis considered a closer measure to human visual perception [37],[38]. To verify this, we provide an illustrative example toexplain the difference between the Mean Square Error (MSE),the Mean Absolute Error (MAE), and NvD.

Specifically, we use the MSE and MAE to measure thediscrepancy of two images (as shown in Fig. 9). The im-age I can be seen to be more similar to image A thanimage B. However, MSE(A, I) = 0.0222 > MSE(B, I) =0.0111 and MAE(A, I) = 0.0556 > MAE(B, I) = 0.0550,show that MSE and MAE may not be suitable. On the other hand,the von-Neumann divergence result DvN(A, I) = 1.1325 <DvN(B, I) = 6.7802 indicates that A is more similar to I ,which is the same as the human visual perception.


In Table II, we report von-Neumann divergence produced bythe various methods. The proposed iNBMT model clearly leadsto significantly smaller values than the other methods, showingthat the iNBMT reconstructed images are more similar to thegroundtruth. This result coincides with our intuition. as observedin Figs. 6–8.

D. Pre-Image Restoration

In this section, we compare the performance of different al-gorithms for pre-image restoration. Latent features are first ob-tained for each model using the training set and the variousfeatures then evaluated in terms of their ability to restore testpre-images. The latter are intentionally halved.

On the second row of Figs. 10 and 11, we again illustratethe ability of the four models to extract hidden features. Unlikeprevious experiments, here, each sample or an image consistsof one single digit or letter rather than four combined digits orletters.

The various methods are first used to learn latent features,which are then exploited to restore incomplete images in thetest set. Specifically, the features are learned from the trainingset, and binary matrix Z’s updated with the test set. The newbinary matrix Z contains znk = 1 if the nth testing element isrecognized as the kth row feature. 20 incomplete digits are usedin testing [see Fig. 10(b)], and each row is the same number(0− 3). The recovered images are illustrated on the bottomrow in Fig. 10. The sub-figures with red boxes are incorrectlyrestored. Similarly, for the Pre-NIST, each row of the test imagesdenotes the same letter in Fig. 11(b). To evaluate the result, weonly need to determine whether the number (letter) of each rowis the label of the test image. From the results in both Pre-NISTand Pre-USPS, the iNBMT can be seen to have almost restoredall the images correctly (except two errors in Pre-USPS and oneerror in Pre-NIST), while ME-IBP, IBP-IBP, and PNMT couldnot restore many of the images. This experiment demonstratesthe advantages of our proposed iNBMT method.

Note that the above restorations were judged perceptually.Though subjective, it is sufficiently clear (to the naked eye) thatimages restored by our iNBMT model are much closer to thegroundtruth images.

E. Clustering

In this section, we evaluate the clustering performance of theproposed iNBMT method on benchmark datasets, Coil-20 andUmist, compared with five related approaches, specifically, theclassical clustering k-mean method, one-side clustering ME-IBP method, and state-of-the-art NMFT methods: PNMT andFNMTF.4 The evaluation has been performed on the basis ofthree standard clustering criteria. c and c′ have been set to truelabels and resulting cluster labels respectively, and N is the totalnumber of samples. In the following, we describe the considered

4Code can be downloaded from https://github.com/lucasbrunialti/nmtf-coclustering

TABLE IIICLUSTERING RESULTS (THE HIGHER, THE BETTER)

Data Coil-20 Umist

Metrics Accuracy NMI Purity Accuracy NMI Purity

iNBMT 0.7911 0.8631 0.7917 0.5426 0.7028 0.5461k-means 0.5563 0.7768 0.6201 0.4696 0.5994 0.4713ME-IBP 0.5792 0.6997 0.6042 0.4713 0.6128 0.4713FNMTF 0.6937 0.7851 0.6944 0.4887 0.6355 0.4904PNMT 0.6111 0.7554 0.6146 0.5009 0.6019 0.5026

evaluation measures:

Accuracy =∑N

i=1 δ(ci,map(c′i))N

;

where c and c′ have been set to true labels and resulting clusterlabels, N is the total number of samples, and δ(·) denotes theDelta function, δ(x, y) = 1 if x = y and δ(x, y) = 0 otherwise.We also map each cluster to an original label. This is usedto measure the percentage of correctly clustered samples. Thenormalized mutual information (NMI) is used to measure themutual information between two sets of clusters c and c′. It isalso employed as an evaluation criterion.

NMI =I(c′, c)

(H[c]) + H[c′])/2, I(c′, c) =

∑

i∈c;j∈c ′

pij log2pij

pipj.

Here, H[c] = −∑N

i=1 pi log2 pi and pij = nij /N refers to theprobability that a member in the cluster j belongs to class i,where nij is the number of members in cluster j belonging toclass i. The purity measures percentage of total number of datapoints that were classified correctly:

Purity =∑

j∈c ′

nj

Npj ,

where nj is the number of all members in cluster j and pj =1

njmaxi(nij ) [40].The clustering experiment reports the results obtained by our

method and four related approaches. The K-means and ME-IBPare one-sided clustering methods. The FNMTF and the PNMFare all co-clustering methods, but with the limitation that thenumber of clusters has to be specified. The FNMTF and PNMFset the number of clusters as the true number of classes and reportthe best average result. Results obtained in Tables III show thataccuracy is significantly increased when co-clustering methodsare applied. Based on the results of three criteria, the proposedmethod iNBMT can be seen to dramatically outperform theother benchmark methods.

VI. COMPLEXITY ANALYSIS

For measuring the inference efficiency, the time complexityper iteration will be calculated on a linear-Gaussian likelihoodmodel. We show that the per-iteration complexity of our IBPbased model outperforms other recently-proposed latent featuremodels [1], [2].


TABLE IVNUMBER OF PARAMETERS COMPARISON.

Methods Per-iteration complexity

iNBMT O(αN + βD)ME-IBP O(γND)IBP-IBP O(δND)

The Per-Iteration Time Complexity of Our Proposed iNBMTis One Order Lower Than the Other Two Models.

In the ME framework, updating p(Y|Z) and p(Y|X) areindependent of remaining observations and only require thecomputation of T (·). Therefore, the T (Z) updating needO(N(K2 ln K)) operations, and O(D(L2 ln L)) operations areinvolved in updating T (X). In our proposed iNBMT model,it yields a per-iteration complexity of O(N(K2L) + D(L2K))for updating q(W), which consists of two parts: O(K2L) oper-ations on optimizing p(Z), and O(L2K) operations on updat-ing p(X). Hence, the total per-iteration complexity of iNBMFyields O(N(K2L) + D(L2K)) operations. The latent featuremodel via IBP proposed in [1] uses similar ME inference overthe latent factors. Its total per-iteration complexity of ME-IBPmodel is easily checked as O(NK2(D + lnK)) consisting ofO(K2D) operations on q(W) and O(N(K2 ln K)) operationson the infinite variable p(Y|Z). Clearly, the operations of ourmodel are mainly reduced when updating the parameters of non-negative matrices. In the correlated IBP-IBP model, there arealso two infinite variables, the category U and latent features V,the per-iteration complexity of p(Y|Z) is O(NK2L ln N). Inaddition, O(NK2L(LD)) operations are needed when updat-ing p(W). Therefore its total per-iteration complexity is aboutO(NK2L(LD + lnN)).

In practice, N and D are usually sufficiently larger than K andL. Hence, the per-iteration complexity of the proposed iNBMTcan be written in a simple form: O(αN + βD), while that ofME-IBP model is simplified to O(γND) and IBP-IBP is sim-plified to O(δND), where α, β, γ, and δ are small coefficients.Clearly, our proposed iNBMT has the per-iteration complexityone order lower than that of the competitive models. For bet-ter comparison with the other feature-based models, we list theper-iteration complexity in Table IV.

VII. CONCLUSION

In this paper, we propose a new Bayesian model, termed in-finite Non-negative Binary Matrix Tri-factorization (iNBMT).The proposed model is proven to be capable of automaticallylearning latent binary features along with feature numbers, basedon the Indian Buffet Process (IBP). The proposed iNBMT en-gages a tri-factorization process that decomposes a nonnegativematrix into the product of three components, including two bi-nary matrices and a non-negative real matrix. We impose anIBP prior on the two infinite binary matrices, while a trun-cated Gaussian distribution is assumed on the weight matrix.To optimize the model, we develop an efficient modified, vari-ational Bayesian algorithm, with the iteration complexity oneorder lower than recently-proposed IBP-based models. We have

carried out a series of experiments which demonstrate that ourproposed iNBMT model significantly outperforms state-of-the-art algorithms on both benchmark and real-world data.

Despite the remarkable performance of our model, some lim-itations still need addressed. First, while our IBP based methodis faster than most other approaches, it is not as fast as the tra-ditional NMF. Second, as observed in experiments, repetitivebinary features can sometimes be extracted. Both these issueswill be addressed in future work, to further optimise our iN-MBT model. Finally, as with many factorization methods, theproposed model can be applied to a range of future applications,including gene expression clustering [41], [42], graph matching[43], and zero-shot learning [44].

APPENDIX AOPTIMIZING LATENT FEATURES

We introduce latent features optimisation in this subsection.Updating Z and X is relatively straightforward by computingEq. (6). Similar to variational IBP methods, we split the ex-pectation in Eq. (4) into terms depending on each of the latentvariables [16], with the benefit that binary variables updatesare not affected by inactive features. Thus, we decompose therelevant terms of X in Eq. (5). Similarly, we also decomposethe terms depending on Z during updating. First, to decomposeln (D−sl )!(sl−1)!

D ! , we define a quadratic pseudo-Boolean func-tion:

f(xdl) =

{0, if sl\d = 0 and xdl = 0;

ln (D−sl \d−xd l )!(sl \d +xd l−1)!D ! , otherwise.

Here sl\d indicates the sum of xdl after removing the dth row

from X. The terms of Eq. (6),∑L+ [ln (D−sl )!(sl−1)!

D ! ] + lnL+!,are changed to

L+∑

l=1

f(xdl) =L+∑

l=1

xdl(f(xdl = 1)−f(xdl = 0))+f(xdl = 0),

ln L+! = ln(L+\d +L+∑

l=1

[1{sl \d =0}xdl ])!,

where 1{·} is the indicator function. Here we show that thelower bound of the evidence Eq. (5) is well defined in the limitL→∞:

T (Xdl) = − 12σ2

Y(XdlΛT

nl)(XdlΛTnl)

T +XdlωTdl +

L+∑

l=1

f(xdl)

−ln L++ln p(Z|α)+K∑

k=1

(1{sl \d=0}xdlϕkl)+const,

where ωdl = − 1σ 2

Y(ΛT

nlYnd + γ) and Λnl = ZE[W].

APPENDIX BUPDATING VARIATIONAL PARAMETERS

The updating of variational parameters for the non-negativematrix W, over a truncated Gaussian distribution, is outlined


below:

q(W) =K∏

k=1

L∏

l=1

TN(wkl ;μkl , σ2kl) =

K∏

k=1

L∏

l=1

N(μkl , σ2kl)

Φ(∞)−Φ(0),

where Φ(a) = 12 (1 + erf( a−μk l√

2σk l)), Φ(∞) = 1 and erf(·) is

the Gaussian error function.In accordance with the upper tail truncation, the parameters

are updated as follows:

E[wkl ] = μkl + σklλ(t), E[w2kl ] = μklE[wkl ] + σ2

kl ,

with λ(t) =√

2/√

πet2(1− erf(t)), t = −μkl/

√2σkl . It is

worth noting that the mean and variance of truncated Gaussiandistributions need be computed twice per iteration.

if K →∞, μk+ l =τ 2N∑

n=1

zTnk (ynd−

∑

k ′/k

znk ′E[wk ′lxTdl ])xdl ,

σk+ l = τσY

if L→∞, μkl+ =τ 2D∑

d=1

xdl(yTnd−

∑

l ′/l

xTdlE[wkl ′znk ′ ])zT

nk ,

σkl+ = τσY ,

where τ = (mTk sl + σ 2

Y

σ 2W

)−12 . Finally, the entropy of truncated

Gaussian distribution is given as:

H(q(wkl)) =1

2σ2kl

{E[wkl ]2 − E[w2

kl ]− (E[wkl ]− μkl)2

−[12

ln2

πσ2kl

− ln(1− erf(t))]}

.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewersfor their insightful comments and suggestions, which helpedimprove the quality of this paper.

REFERENCES

[1] C. Reed and Z. Ghahramani, “Scaling the indian buffet process via sub-modular maximization,” in Proc. 30th Int. Conf. Mach. Learn., 2013,pp. 1013–1021.

[2] F. Doshi-Velez and Z. Ghahramani, “Correlated non-parametric latent fea-ture models,” in Proc. 25th Conf. Uncertainty Artif. Intell., 2012, pp. 143–150.

[3] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factoriza-tion,” in Proc. Adv. Neural Inform. Process. Syst. Conf., 2001, pp. 556–562.

[4] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negativematrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

[5] T. Li and C. H. Q. Ding, “Nonnegative matrix factorizations for clustering:A survey,” in Proc. Data Clustering: Algorithms Appl., 2013, pp. 149–176.

[6] K. Huang, H. Yang, I. King, and M. R. Lyu, Machine Learning: Model-ing Data Locally and Globally. Berlin, Germany: Springer-Verlag, 2008.ISBN 3-5407-9451-4.

[7] J. Yoo and S. Choi, “Orthogonal nonnegative matrix tri-factorization forco-clustering: Multiplicative updates on Stiefel manifolds,” Inf. Process.Manag., vol. 46, no. 5, pp. 559–570, 2010.

[8] R. G. Soares, H. Chen, and X. Yao, “A cluster-based semisupervised en-semble for multiclass classification,” IEEE Trans. Emerg. Topics Comput.Intell., vol. 1, no. 6, pp. 408–420, Dec. 2017.

[9] X. Luo, M. Zhou, Y. Xia, and Q. Zhu, “An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommendersystems,” IEEE Trans. Ind. Informat., vol. 10, no. 2, pp. 1273–1284, May2014.

[10] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrixtri-factorizations for clustering,” in Proc. 20th Int. Conf. Knowl. DiscoveryData Mining, 2006, pp. 126–135.

[11] Z. Zhang, T. Li, C. H. Q. Ding, X. Ren, and X. Zhang, “Binary matrixfactorization for analyzing gene expression data,” Data Mining Knowl.Discovery, vol. 20, no. 1, pp. 28–52, 2010.

[12] H. Wang, F. Nie, H. Huang, and F. Makedon, “Fast nonnegative matrixtri-factorization for large-scale data co-clustering,” in Proc. 22nd Int. JointConf. Artif. Intell., 2011, pp. 1553–1558.

[13] S. Wang and A. Huang, “Penalized nonnegative matrix tri-factorizationfor co-clustering,” Expert Syst. Appl., vol. 78, pp. 64–73, 2017.

[14] B. Fan, Q. Kong, T. Trzcinski, Z. Wang, C. Pan, and P. Fua, “Recep-tive fields selection for binary feature description,” IEEE Trans. ImageProcess., vol. 23, no. 6, pp. 2583–2595, Jun. 2014.

[15] B. Fan et al., “Do we need binary features for 3d reconstruction?” inProc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2016, pp. 53–62.

[16] F. Doshi-Velez, K. T. Miller, J. V. Gael, and Y. W. Teh, “Variationalinference for the indian buffet process,” in Proc. Int. Conf. Artif. Intell.Statist., 2009, pp. 137–144.

[17] S. Hasler, H. Wersing, and E. Korner, “Combining reconstruction anddiscrimination with class-specific sparse coding,” Neural Comput., vol. 19,no. 7, pp. 1897–1918, 2007.

[18] X. Yang, K. Huang, R. Zhang, and A. Hussain, “Learning latent featureswith infinite non-negative binary matrix tri-factorization,” in Proc. Int.Conf. Neural Inf. Process., 2016, pp. 587–596.

[19] K. T. Miller, T. L. Griffiths, and M. I. Jordan, “The phylogenetic in-dian buffet process: A non-exchangeable nonparametric prior for la-tent features,” in Proc. 24th Conf. Uncertainty Artif. Intell., 2008,pp. 403–410.

[20] D. A. Knowles and Z. Ghahramani, “Infinite sparse factor analysis andinfinite independent components analysis,” in Proc. Int. Conf. IndependentCompon. Anal. Signal Separation, 2007, pp. 381–388.

[21] T. L. Griffiths and Z. Ghahramani, “Infinite latent feature models and theindian buffet process,” in Proc. Adv. Neural Inform. Process. Syst., 2005,vol. 18, pp. 475–482.

[22] R. KRAUSE and D. L. WILD, “Identifying protein complexes in high-throughput protein interaction screens using an infinite latent featuremodel,” in Proc. Adv. Pacific Symp. Biocomputing, 2006, vol. 11, pp. 231–242.

[23] D. J. Navarro and T. L. Griffiths, “Latent features in similarity judgments:A nonparametric Bayesian approach,” Neural Comput., vol. 20, no. 11,pp. 2597–2628, 2008.

[24] D. Gorur, F. Jakel, and C. E. Rasmussen, “A choice model with in-finitely many latent features,” in Proc. 23rd Int. Conf. Mach. Learn.,2006, pp. 361–368.

[25] K. Miller, M. I. Jordan, and T. L. Griffiths, “Nonparametric latent featuremodels for link prediction,” in Proc. Adv. Neural Inf. Process. Syst. , 2009,vol. 22, pp. 1276–1284.

[26] H. P. Dang and P. Chainais, “Indian buffet process dictionary learning: Al-gorithms and applications to image processing,” Int. J. Approx. Reasoning,vol. 83, pp. 1–20, 2017.

[27] Q. Pan, D. Kong, C. H. Q. Ding, and B. Luo, “Robust non-negativedictionary learning,” in Proc. 28th Conf. Artif. Intell., 2014, pp. 2027–2033.

[28] Z. Ghahramani and M. J. Beal, “Variational inference for Bayesian mix-tures of factor analysers,” in Proc. Adv. Neural Inf. Process. Syst., 1999,vol. 12, pp. 449–455.

[29] H. Attias, “A variational Baysian framework for graphical models,” inProc. Adv. Neural Inf. Process. Syst., 1999, vol. 12, pp. 209–215.

[30] Z. Ghahramani and M. J. Beal, “Propagation algorithms for variationalBayesian learning,” in Proc. Adv. Neural Inf. Process. Syst., 2000, vol. 13,pp. 507–513.

[31] S. Gershman, P. I. Frazier, and D. M. Blei, “Distance dependent infinitelatent feature models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37,no. 2, pp. 334–345, Feb. 2015.

[32] T. L. Griffiths and Z. Ghahramani, “The indian buffet process: An in-troduction and review,” J. Mach. Learn. Res., vol. 12, pp. 1185–1224,2011.

[33] P. Rai and H. D. III, “Multi-label prediction via sparse infinite CCA,” inProc. Adv. Neural Inf. Process. Syst. Conf., 2009, pp. 1518–1526.


[34] S. Williamson, P. Orbanz, and Z. Ghahramani, “Dependent indian buffetprocesses,” in Proc. 20th Int. Conf. Artif. Intell. Statist., 2010, pp. 924–931.

[35] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library(coil-20),” Dept. Comp. Sci., Columbia University, Tech. Rep. CUCS-005-96, Feb. 1996.

[36] I. S. Dhillon and J. A. Tropp, “Matrix nearness problems with Bregmandivergences,” SIAM J. Matrix Anal. Appl., vol. 29, no. 4, pp. 1120–1146,2007.

[37] B. Kulis, M. A. Sustik, and I. S. Dhillon, “Low-rank kernel learning withBregman matrix divergences,” J. Mach. Learn. Res., vol. 10, pp. 341–376,2009.

[38] P. Yang, K. Huang, and C. Liu, “Geometry preserving multi-task metriclearning,” Mach. Learn., vol. 92, no. 1, pp. 133–175, 2013.

[39] P. Yang, K. Huang, and C.-L. Liu, “A multi-task framework for metriclearning with common subspace,” Neural Comput. Appl., vol. 22, no. 7/8,pp. 1337–1347, 2013.

[40] C. H. Q. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix t-factorizations for clustering,” in Proc. 20th Int. Conf. Knowl.Discovery Data Mining, 2006, pp. 126–135.

[41] H. Wang, H. Huang, C. Ding, and F. Nie, “Predicting protein–protein in-teractions from multimodal biological data sources via nonnegative matrixtri-factorization,” J. Comput. Biol., vol. 20, no. 4, pp. 344–358, 2013.

[42] C. Li, Z.-Y. Liu, X. Yang, Jianhua-Su, and H. Qiao, “Stitching contami-nated images,” Neurocomputing, vol. 214, pp. 829–836, 2016.

[43] Z.-Y. Liu and H. Qiao, “Gnccp – Graduated nonconvexity and concav-ity procedure,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6,pp. 1258–1267, Jun. 2014.

[44] X. Xu, F. Shen, Y. Yang, D. Zhang, H. T. Shen, and J. Song, “Matrixtri-factorization with manifold regularizations for zero-shot learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2007–2016.

Xi Yang received the B.Sc. degree in the mathematicswith finance from the University of Liverpool, Liver-pool, U.K., the M.Sc. degree with distinction from theDepartment of Computer Science, University of Liv-erpool. She is currently working toward the Ph.D.degree in the Department of Electrical and Elec-tronic Engineering, Xi’an Jiaotong-Liverpool Uni-versity, Suzhou, China. Her research interests includestatistical machine learning included dimensional-ity reduction, supervised and unsupervised latentvariable methods, and the Bayesian nonparametricmethods.

Kaizhu Huang received the B.Sc. degree in engineer-ing in 1997, the M.Sc. degree in engineering from In-stitute of Automation, Chinese Academy of Sciences(CASIA), Beijing, China, in July 2000, and the Ph.D.degree from The Chinese University of Hong Kong(CHUK) in 2004. He is currently the Head of theDepartment of Electrical and Electronic Engineeringand a Professor with Xi’an Jiaotong-Liverpool Uni-versity. Previously, he was an Associate Professorwith the National Laboratory of Pattern Recognition,CASIA. He was a Student of the Special Class for

Gifted Youth at Xi’an Jiaotong University. From 2004 to 2007, he worked asa Research Scientist with Fujitsu R&D Centre. From 2008 to 2009, he was aResearch Fellow with CUHK and a Researcher with the University of Bristol,U.K. He has published more than 120 papers including over 50 internationaljournal papers. His research interests include machine learning, pattern recog-nition, and neural information processing. He has received many awards suchas the Asia Pacific Neural Network Society Young Investigator Award in 2011.

Rui Zhang received the First-Class (Hons.) degreein telecommunication engineering from Jilin Uni-versity, Changchun, China and the Ph.D. degree incomputer science and mathematics from the Univer-sity of Ulster, Coleraine, U.K., in 2001 and 2007,respectively. She worked as a Research Associatewith the University of Bradford and the Universityof Bristol, U.K. for 5 years. In 2012, she joined Xi’anJiaotong-Liverpool University and is currently an As-sociate Professor. Her research interests include ma-chine learning, data mining, and statistical analysis.

Amir Hussain (SM’97) received the B.Eng. (High-est First-Class Hons.) and Ph.D. degrees in novelneural network architectures and algorithms from theUniversity of Strathclyde, Glasgow, U.K., in 1992and 1997, respectively. From 1996 to 1998, he was aResearch Fellow with the University of Paisley (cur-rently, the West of Scotland), Paisley, U.K. From 1998to 2000, he was a Research Lecturer with the Univer-sity of Dundee, Dundee, U.K.. In 2000, he joinedthe University of Stirling, Stirling, U.K., where heis currently a Professor of cognitive computing, and

the Founding Director of the Cognitive Big Data Informatics Laboratory. Hehas published nearly 300 papers, including over a dozen books and 80 journalpapers. His research interests include next generation braininspired multimodalcognitive technology for solving complex real-world problems, such as medi-cal and social multimedia big data analytics, learning, visualization, sentimentand opinion mining, multilingual natural language processing, personalized andpreventative (e- and m-) healthcare, cognitive agent-based complex autonomoussystems, cognitive hearing systems, natural multimodal human-computer inter-action, and assistive technology and related clinical research. Prof. Hussain isthe Founding Editor-in-Chief of Cognitive Computation and Big Data Analytics,and the Chief-Editor of the Springer Book Series on Socio-Affective Comput-ing, and Springer Briefs on Cognitive Computation. He is an Associate Editorof the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,a Member of several Technical Committees of the IEEE Computational Intelli-gence Society, the Founding Publications Co-Chair of the International NeuralNetwork Society Big Data Section, and the Chapter Chair of the IEEE U.K. andRI Industry Applications Society.

learning latent features with infinite nonnegative binary...

Documents