coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · coclustering—a useful...

8
Coclusteringa useful tool for chemometrics Rasmus Bro a *, Evangelos E. Papalexakis b , Evrim Acar a and Nicholas D. Sidiropoulos c Nowadays, chemometric applications in biology can readily deal with tens of thousands of variables, for instance, in omics and environmental analysis. Other areas of chemometrics also deal with distilling relevant information in highly information-rich data sets. Traditional tools such as the principal component analysis or hierarchical cluster- ing are often not optimal for providing succinct and accurate information from high rank data sets. A relatively little known approach that has shown signicant potential in other areas of research is coclustering, where a data matrix is simultaneously clustered in its rows and columns (objects and variables usually). Coclustering is the tool of choice when only a subset of variables is related to a specic grouping among objects. Hence, coclustering allows a select number of objects to share a particular behavior on a select number of variables. In this paper, we describe the basics of coclustering and use three different example data sets to show the advantages and shortcomings of coclustering. Copyright © 2012 John Wiley & Sons, Ltd. Keywords: clustering; coclustering; L1 norm; sparsity 1. INTRODUCTION The chemometric eld is dealing with increasingly complex data, for instance, in omics, quantitative structureactivity relationships, and environmental analysis. It is not uncommon to use hyphen- ated methods for measuring thousands of chemical compounds. This is quite different from traditional chemometric applications, for instance, in spectroscopy where the number of variables (wave- lengths) may be high but the actual number of chemicals reected in the datathe chemical rankis typically low. Approaches such as principal component analysis (PCA) are very well suited for analyzing fairly low rank data, especially when the gathered data are known to be relevant to the problem being investigated. Traditional clustering techniques are more useful for exploratory analyses of classicaldata. However, with the increasing number of variables being measured nowadays, there is an interesting opposite trend toward not being interested in modeling the full data. Instead, the focus is often on nding few, so-called, biomar- kers. A biomarker can be a specic chemical compound indicative of a pathological condition or indicative of intake of certain food stuff. Thus, even though the actual amount of data and informa- tionincreases, at the same time, the need for simplifying the visualization, interpretation, and understanding increases. In coclustering, a data matrix is simultaneously clustered in its rows and columns (objects and variables usually). Coclustering is by no means new [11], but it has attracted considerable interest in recent years because of some algorithmic developments and its promising performance in various applicationsparticularly in bioinformatics [15]. One of the main advantages of coclustering is that it clusters both objects (samples) and variables simultaneously. Suppose we have a data set that shows the food intake of various items for a group of people from Belgium and Korea. In order to nd the clusters in this data set, we may use a simple approach where the samples are clustered rst, and subsequently, the variables are clustered. It is conceivable that the main clusters could be exactly Asian and European because, overall, the main difference in intake relates to cultural differences. Hence, clustering among samples would split the samples into these two groups. It is also conceivable that there could be another grouping because of, for example, some people preferring sh. However, because sh-related items are only a small part of the variables and sh lovers appear in both populations, such a cluster cannot be real- ized. On the other hand, coclustering could capture both a country and a sh cluster because it considers which samples are related with which variables at the same time rather than one modality at a time. Hence, coclustering is the tool of choice when subsets of subjects are related with respect to corresponding subsets of variables. For some coclustering methods it also holds that an individual subject (or variable) can belong to several (or no) clus- ters. This is so-called overlapping coclustering as opposed to non-overlapping coclustering where each variable is assigned to at most one cluster. In the following, we describe the theory behind coclustering and subsequently exemplify coclustering on a toy data set reecting different kinds of animals, on a data set of chromato- graphic measurements of olive oils, as well as on cancer gene expression data. * Correspondence to: R. Bro, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg, Denmark. E-mail: [email protected] a R. Bro, E. Acar Department of Food Science, Faculty of Life Sciences, University of Copenhagen, DK-1958 Frederiksberg, Denmark b E. E. Papalexakis School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA c N. D. Sidiropoulos Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA Special Issue Article Received: 21 September 2011, Revised: 3 January 2012, Accepted: 3 January 2012, Published online in Wiley Online Library: 2012 (wileyonlinelibrary.com) DOI: 10.1002/cem.1424 J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd.

Upload: others

Post on 25-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

Coclustering—a useful tool for chemometricsRasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara

and Nicholas D. Sidiropoulosc

Nowadays, chemometric applications in biology can readily deal with tens of thousands of variables, for instance, inomics and environmental analysis. Other areas of chemometrics also deal with distilling relevant information inhighly information-rich data sets. Traditional tools such as the principal component analysis or hierarchical cluster-ing are often not optimal for providing succinct and accurate information from high rank data sets. A relatively littleknown approach that has shown significant potential in other areas of research is coclustering, where a data matrix issimultaneously clustered in its rows and columns (objects and variables usually).

Coclustering is the tool of choice when only a subset of variables is related to a specific grouping among objects.Hence, coclustering allows a select number of objects to share a particular behavior on a select number of variables.

In this paper, we describe the basics of coclustering and use three different example data sets to show the advantagesand shortcomings of coclustering. Copyright © 2012 John Wiley & Sons, Ltd.

Keywords: clustering; coclustering; L1 norm; sparsity

1. INTRODUCTION

The chemometric field is dealing with increasingly complex data,for instance, in omics, quantitative structure–activity relationships,and environmental analysis. It is not uncommon to use hyphen-ated methods for measuring thousands of chemical compounds.This is quite different from traditional chemometric applications,for instance, in spectroscopy where the number of variables (wave-lengths) may be high but the actual number of chemicals reflectedin the data—the chemical rank—is typically low. Approaches suchas principal component analysis (PCA) are very well suited foranalyzing fairly low rank data, especially when the gathered dataare known to be relevant to the problem being investigated.Traditional clustering techniques aremore useful for exploratory

analyses of “classical” data. However, with the increasing numberof variables being measured nowadays, there is an interestingopposite trend toward not being interested in modeling the fulldata. Instead, the focus is often on finding few, so-called, biomar-kers. A biomarker can be a specific chemical compound indicativeof a pathological condition or indicative of intake of certain foodstuff. Thus, even though the actual amount of data and “informa-tion” increases, at the same time, the need for simplifying thevisualization, interpretation, and understanding increases.In coclustering, a data matrix is simultaneously clustered in its

rows and columns (objects and variables usually). Coclustering isby no means new [11], but it has attracted considerable interestin recent years because of some algorithmic developments andits promising performance in various applications—particularlyin bioinformatics [15].One of the main advantages of coclustering is that it clusters

both objects (samples) and variables simultaneously. Supposewe have a data set that shows the food intake of various itemsfor a group of people from Belgium and Korea. In order to findthe clusters in this data set, we may use a simple approachwhere the samples are clustered first, and subsequently, thevariables are clustered. It is conceivable that the main clusterscould be exactly Asian and European because, overall, the main

difference in intake relates to cultural differences. Hence, clusteringamong samples would split the samples into these two groups. It isalso conceivable that there could be another grouping because of,for example, some people preferring fish. However, becausefish-related items are only a small part of the variables and fishlovers appear in both populations, such a cluster cannot be real-ized. On the other hand, coclustering could capture both a countryand a fish cluster because it considers which samples are relatedwith which variables at the same time rather than one modalityat a time.

Hence, coclustering is the tool of choice when subsets ofsubjects are related with respect to corresponding subsets ofvariables. For some coclustering methods it also holds that anindividual subject (or variable) can belong to several (or no) clus-ters. This is so-called overlapping coclustering as opposed tonon-overlapping coclustering where each variable is assignedto at most one cluster.

In the following, we describe the theory behind coclusteringand subsequently exemplify coclustering on a toy data setreflecting different kinds of animals, on a data set of chromato-graphic measurements of olive oils, as well as on cancer geneexpression data.

* Correspondence to: R. Bro, Department of Food Science, Faculty of Life Sciences,University of Copenhagen, DK-1958 Frederiksberg, Denmark.E-mail: [email protected]

a R. Bro, E. AcarDepartment of Food Science, Faculty of Life Sciences, University of Copenhagen,DK-1958 Frederiksberg, Denmark

b E. E. PapalexakisSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

c N. D. SidiropoulosDepartment of Electrical and Computer Engineering, University of Minnesota,Minneapolis, MN, USA

Special Issue Article

Received: 21 September 2011, Revised: 3 January 2012, Accepted: 3 January 2012, Published online in Wiley Online Library: 2012

(wileyonlinelibrary.com) DOI: 10.1002/cem.1424

J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd.

Page 2: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

2. THEORY

We assume that our data forms a matrix X of dimensions I� J.

2.1. Coclustering with sparse matrix regression

Coclustering can be formulated as a constrained outer productdecomposition of the data matrix, with sparsity on the latentfactors of the bilinear model [17]. Each cocluster is representedby a rank-1 component of the decomposition. Instead of usinga plain bilinear model, sparsity on the latent factors is imposed.Intuitively, latent sparsity selects the appropriate rows andcolumns that belong to each cocluster, rendering all other coef-ficients that do not belong to a certain cocluster exactly zero.Hence, each bilinear component represents a cocluster.

Mathematically, this coclustering scheme may be stated as theminimization of the following loss function:

X� ABT�� ��2

Fþ l

X

i;k

Aikj j þ lX

j;k

Bjk

�� ��

where A and B are matrices of size I� K and J� K, respectively; Kcorresponds to the number of extracted coclusters. The sum ofabsolute values is used as a sparsity-inducing surrogate for thenumber of nonzero elements, for example, see Ref. [19], and lis a sparsity-controlling parameter.

The loss function can be interpreted as a constrained versionof a bilinear model such as PCA. Rotations such as varimax [12]also aim at simplicity and sparsity, but they do so in a losslessmanner, where the actual bilinear approximation of the data isleft unchanged. It is merely rotated toward a simpler view thatwill not usually lead to real sparsity.

Doubly sparse matrix factorization as shown has been proposedearlier [13,20]. Witten et al. [20] proposed adding sparsity-inducinghard one-norm constraints on both left and right latent vectors, asa variation of sparse singular value decomposition and sparsecanonical correlation analysis. Although their model was not devel-oped with coclustering in mind, it is similar to sparse matrixregression (SMR), which uses soft one-norm penalties instead ofhard constraints (and possibly non-negativity when appropriate).Algorithmically, Witten et al. [20] use a deflation algorithm thatextracts one rank-1 component at a time, instead of alternatingoptimization across rank-1 components as in SMR.

Lee et al. [13] proposed a similar approach specifically for coclus-tering. However, their algorithm is not guaranteed to convergebecause the penalties are not kept fixed during iterations. As aresult, the algorithm in Lee et al. [13] does not monotonically re-duce a tangible cost function, and instabilities are not uncommon.

In Papalexakis et al. [18], a coordinate descent algorithm isproposed in order to solve the given optimization problem. Morespecifically, one may solve this problem in an alternating fashion,where each subproblem is basically a least absolute shrinkageand selection operator problem [16,19]. We have to note that aglobal minimum for the bilinear problem may not be attained;the existing algorithms guarantee a local minimum or saddlepoint solution only.

The SMR coclustering algorithm [18] may be characterized as asoft or fuzzy coclustering algorithm, in the sense that coclustermembership is not merely a zero or one, but can be any valuein between. Some rows and columns may not be assigned toany cocluster, and overlapping coclusters are allowed and canbe extracted.

It follows that when sparsity is imposed to such an extent thatrows and columns are completely left out, the concept of assessingresidual sums of squares or fit values is not meaningful or at leastnot meaningful in the same sense as for ordinary least squaresfitting. Therefore, other means for evaluating the usefulness of amodel are needed. Such are described in the following sectionon metaparameters. Also, interpreting why certain samples orvariables are left “orphan” may be useful for understanding thecoclustering. This is usually an application-specific problem.One may add non-negativity constraints to the given loss

function formulation, which can be readily applied within theexisting coordinate descent algorithm with minor modifications.Although our focus here will be on SMR coclustering because of

its appropriateness for chemometric applications, there are severaltypes of coclustering models and algorithms that are popular inother areas and worth mentioning. Banerjee et al. [1,3,8] haveintroduced a class of coclustering algorithms that use Bregmandivergences, unified in an abstract framework. Bregman cocluster-ing is a hard coclustering technique, in the sense that it seeks tolocate a non-overlapping “checkerboard” structure in the data. Thistype of coclustering is typically not of interest in chemometrics,where one often deals with data that contain large numbers ofpotentially irrelevant variables. Dhillon [7] has formulated coclus-tering as a bipartite graph partitioning problem originally in thecontext of coclustering of documents and words from a documentcorpus. This algorithm can also be classified as hard coclustering. Inaddition, this algorithm works for non-negative data only. Initialtesting of various algorithms has shown that appearance of localminima is a common problem. In fact, most hard coclusteringalgorithms seem to have much more pronounced problems withlocal minima than soft coclustering ones. Furthermore, the possi-ble local minima in soft coclustering are often distinct (e.g., rankdeficient) and hence easier to spot. Other approaches thatare more distantly related are methods presented by Damianet al. [4] and Friedman and Meulman [9], which do not accountfor sparsity, and the hard coclustering method of Hagemanet al. [10], which uses a genetic algorithm that is sensitiveto local minima.

2.2. Metaparameters

For SMR, there are certain meta-parameters, that is, the penalty land number of coclusters that need to be chosen. The number ofcoclusters must be selected in most coclustering methods, butfor SMR, which is not based on hard clustering, it is found thatin many cases, the clusters are exactly or approximately nestedas we increase the number of clusters. Hence, for example, fora solution with five coclusters, it is often found that the first threecoclusters is approximately equal the solution found using onlythree coclusters. The reason for this approximate nestedness iscurrently being investigated further. In any case, it greatly simpli-fies the use of the method. For hard coclustering methods, asimilar behavior is naturally not observed.In practice, the metaparameters are mostly determined in the

following way: The penalty for a given number of components ischosen so that it is active. Choosing l that is too small wouldgive an inactive penalty, and choosing l that is too big wouldlead to some components/coclusters with all zero values. Asimple line search can be implemented to find a value of l thatis active without leading to all zeros. It is generally seen that thespecific setting of l is not critical, but of course, any automaticallydetermined value of l can be further refined. This has not been

R. BRO ET AL.

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)

Page 3: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

Table

1.Animal

data

setused

toillustratecoclusterin

g

Has

eyes

Num

ber

oflegs/

arms

Carnivo

reFeathe

rWings

Dom

esticized

Eatenby

Cau

casian

s>10

0kg

>2m

Breathe

unde

rwater

Extin

ctDan

gerous

Life

expe

ctan

cyRa

ndom

Has a

beak

Walk

on two

legs

Speed

(MPH

)

Gira

ffe

14

00

00

01

10

00

301

00

32Cow

14

00

01

11

10

00

153

00

30Lion

14

10

00

01

00

01

156

00

50Gorilla

14

00

00

01

00

01

302

01

25Fly

16

00

10

00

00

00

0,1

70

05

Spider

18

10

00

00

00

00

18

00

1Sh

ark

10

10

00

01

01

01

504

00

30Hou

se0

00

00

00

11

00

010

09

00

0Horse

14

00

01

11

10

00

152

00

40Elep

hant

14

00

00

01

10

00

356

00

25Mam

moth

14

00

00

01

10

10

355

00

25SabreTige

r1

41

00

00

10

01

115

70

040

Pig

14

00

01

11

00

00

258

00

11Cod

10

10

00

10

01

00

409

00

2Eel

10

10

00

10

01

00

551

00

20Jellyfish

10

00

00

00

01

00

0,7

30

01

Dolph

in1

01

00

00

11

10

030

50

035

Nem

o1

00

00

00

00

10

01

60

04

Shrim

p1

00

00

01

00

10

01

20

00,5

Dog

14

10

01

00

00

00

138

00

35Cat

14

10

01

00

00

00

259

00

30Fo

x1

41

00

00

00

00

014

40

042

Wolf

14

10

00

00

00

01

183

00

25Ra

bbit

14

00

01

10

00

00

98

00

35Chicken

12

01

11

10

00

00

151

11

9Eagle

12

11

10

00

00

00

553

11

60Seag

ull

12

11

10

00

00

00

106

11

25Blackb

ird1

21

11

00

00

00

018

01

125

Bat

12

10

10

00

00

00

244

00

8T.Re

x.1

41

00

00

11

01

140

90

125

Neand

erthal

14

10

00

00

00

10

508

01

18Triceratop

s1

41

00

00

11

01

130

50

010

Man

14

10

00

00

00

00

802

01

28Pe

nguin

12

11

10

00

00

00

154

11

25

Coclustering

J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

Page 4: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

pursued here. In order to determine the number of coclusters, afairly ad hoc approach has been used. Because coclustering is usedfor exploratory analysis and the solution is nested, we simplyextract sufficiently many components to explain the main clus-ters. More rigorous approaches such as cross-validation couldbe implemented, but we do not see the predictive ability ofcoclustering as a very meaningful criterion to optimize. Rather,we find that interpretability of clusters is what is often soughtand what we focus on here.

3. MATERIALS AND METHODS

A toy data set is constructed for illustrating the behavior of coclus-tering in general. This data set shows attributes of differentanimals, and the data were not made particularly meticulously.Several variables are not well defined, but this is of moderateconsequence in this context. Also, the data were made from theauthors’ point of view, for example, in terms of which animals aredomesticized. In Table I, the data set is tabulated. Note that thedata also includes an outlying sample (house) and an outlyingvariable (random).

As another example, data from Refs [5,6] are analyzed. Onehundred twenty-six oil samples are analyzed by HPLC coupled toa charged aerosol detector. Of the oil samples, 68 were varioustypes, and grades of olive oils and the remaining were eithernon-olive vegetable oils or non-olive vegetable oils mixed witholive oil. The HPLC method is aimed at providing a triacylglycerideprofile of the oils. The triacylglycerides are known to have a distinctpattern for olive oils. The data were baseline corrected and alignedas described in the original work, and the resulting data afterremoval of a few outliers is shown in Figure 1.

As a final data set, we looked at a typical gene expression dataset. A total of 56 samples were selected from a cohort of lungcancer patients assayed by using the Affymetrix 95av2 GeneChipbrand oligonucleotide array. The 56 patients represent fourdistinct histological types: normal lung, pulmonary carcinoidtumors, colon metastases, and small cell carcinoma. The datahave been described in several publications [2,14] and also usingcoclustering [13]. The original data set contains 12 625 genes.Unlike most publications, no pre-selection to reduce the numberof genes is performed here. Rather, coclustering is applied directlyon the data. The data set holds information on 56 patients of which20 are pulmonary carcinoid samples, 13 colon cancer metastasissamples, 17 normal lung samples, and 6 small cell carcinomasamples. The data set is fairly easy to cluster into these four groups.

The data and the algorithm can be found at www.models.life.ku.dk (January 2012).

4. RESULTS

4.1. Looking at the animal data set

It is interesting to investigate the outcome of a simple PCAmodel on the auto-scaled animal data. In Figure 2, a score plotof the first two components of a PCA model is shown. Compo-nent 1 seems to reflect birds, which is verified from the loadingvector that has high values for the variables: feather, wings, hasa beak, and walk on two legs. Component 2, though, is difficultto interpret and seems to reflect a mix of different properties.This is also apparent from the loading plot.Looking at components 3 and 4 (Figure 3), similar complications

arise in interpreting the meaning of different components. All butthe first component reflect several phenomena in a contrastfashion, and often, it is difficult to extract and distinguish the im-portant variation.Turning to SMR, a model is fitted using six coclusters. Similar

results are obtained with different numbers of coclusters, butwe chose six here to exemplify the results. The data are scaled,not centered, and non-negativity is imposed. It is possible to plotthe resulting components/clusters as ordinary PCA componentsin scatter or line plots. However, the semi-discrete nature ofthe clusters sometimes makes such visualizations less efficient.Instead, we have developed a plot where each cluster is shownby labels of all samples and variables larger than a threshold.This threshold was set to 20% of maximum but was inactive herebecause all elements smaller than 20% of maximum were exactlyzero. Furthermore, the size of the label indicates the size of theelement. This provides an intuitive visualization as shown inFigure 4 for the six-cocluster SMR model.It is striking how easy it is to assess the meaning of this model

compared with the PCA model. Looking at the coclusters one ata time, it is observed that cocluster 1 is a bird cocluster. Cocluster2 is given by one variable (extinct) and is evident. Cocluster 3comprises big animals. Note how several samples in coclusters2 and 3 coincide. Animals in cocluster 4 are “grown” and eatenby people, and cocluster 5 captures animals living in water.Finally, cocluster 6 is too dense to allow an easy interpretation.It is apparently a cocluster relating to the overall variationand is in this sense taking care of the offsets induced by the lackof centering.

100 200 300 400 500 600

5

10

15

20

Time [au]

Figure 1. Preprocessed chromatographic data.

R. BRO ET AL.

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)

Page 5: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

There is a dramatic difference in how easy it is to visualize theresults of PCA and SMR, but the data set is simple in the sensethat there are no significant amounts of irrelevant variation. In

order to see how SMR can deal with irrelevant variation, 30random variables (uniformly distributed) were added to theoriginal 17 variables. The data were scaled such that each vari-able had unit variance and SMR was performed.

In Figure 5, it is seen that the method very nicely distinguishesbetween the animal-related information and the random variables.All coclusters but cocluster 7 are easy to interpret. Cocluster 7 is notsparse at all—it comprises almost all variables and all samples.Also, note that the remaining coclusters are not identical to thecoclusters found before, but they are indeed fairly similar.

4.2. Olive oils

For the olive oil data set, a nice separation is achieved with threecoclusters. Adding more does not seem to change the coclustersobtained in the three cocluster model, and the added coclustersare not immediately meaningful. In Figure 6, it is seen thatcocluster 1 reflects olive oils, whereas cocluster 2 reflects non-olive oils. The mixed samples containing some olive oils areplaced in between. The third cocluster seems to reflect only afraction of the olive oils. This is likely related to the olive oilsbeing a very diverse class of samples spanning from pomace toextra virgin oil. The corresponding elution profiles of each clusterare meaningful. The first (olive oil) cocluster has peaks around300 and 400 (arbitrary units), and those peaks represent themain olive oil triacylglycerides (triolein, 1,2-olein-3-palmitin, and1,2-olein-3-linolein). Likewise, the non-olive oil cocluster repre-sents trilinolein, 1,2-linolein-3-olein, and 1,2-linolein-3-palmitin,which are frequent in non-olive oils. It is satisfying to see thatthe olive oil samples are clustered together, as desired, eventhough SMR is an unsupervised approach that does not useany prior or side information.

The results obtained with coclustering are not too different fromwhat would be obtained with PCA. In fact, it is somewhat disturb-ing that there is a distinct lack of sparsity. Although the modelmakes sense from a chemical point of view, little sparsity is seen,for example, in loading 1 and 2 (on the other hand, loading 3 issparse, and so are scores 2 and 3 to a certain extent). As describedin the theory section, the magnitude of the L1 penalty is automat-ically chosen, but it turns out that it is not possible to obtain more

-0.2 0.2 0.6 1-0.4

-0.2

0

0.2

0.4

0.6

Scores on PC 1 (30.26%)

Sco

res

on

PC

2 (

18.3

0%)

Giraffe

Cow

LionGorilla

FlySpider

SharkHouse

Horse

Elephant

Mammoth

Sabre Tiger

PigCodEelJellyfish

Dolphin

Nemo

Shrimp

DogCat

Fox

Wolf

Rabbit

Chicken

EagleSeagullBlackbird

Bat

T. Rex.

Neanderthal

Triceratops

ManPenguin

PCA score plot

-0.4 -0.2 0 0.2 0.4 0.6 0.8

-0.4

-0.2

0

0.2

0.4

Loading 1 (30.26%)

Lo

adin

g 2

(18

.30%

)

Has eyes

Number of legs/arms

Carnivor

FeatherWings

Domesticized

Eaten by Caucassians

>100kg>2m

Breathe under water

ExtinctDangerous

Life expectancy

RandomHas a beak

Walk on two legs

Speed (MPH)

PCA loading plot

Figure 2. Top: score plot of PCA model. Bottom: corresponding load-ing plot.

5 10 15 20 25 30-0.4

-0.2

0

0.2

0.4

0.6

Sco

res

on

PC

3 (

13.1

5%)

Giraffe

Cow

LionGorillaFlySpider

Shark

House

Horse

ElephantMammoth

Sabre Tiger

Pig

CodEelJellyfish

Dolphin

NemoShrimp

DogCat

FoxWolf

RabbitChicken

EagleSeagullBlackbirdBat

T. Rex.

Neanderthal

Triceratops

ManPengui

5 10 15 20 25 30-0.4

-0.2

0

0.2

0.4

Sample

Sco

res

on

PC

4 (

8.87

%)

GiraffeCow

Lion

Gorilla

Fly

Spider

Shark

House

HorseElephantMammoth

Sabre TigerPig

CodEel

Jellyfish

Dolphin

NemoShrimp

DogCatFoxWolf

Rabbit

ChickenEagleSeagull

Blackbird

Bat

T. Rex.

Neanderthal

Triceratops

Man

Pengui

Figure 3. Scores 3 and 4 from a PCA model.

Coclustering

J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

Page 6: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

sparsity than shown here. Manually increasing l leads to a modelwhere one component/cocluster turns all zero and hence rankdeficient. This points to a problem with the current coclusteringapproach. Because l is the same for both the row and thecolumn mode, problems or lack of sparsity may occur whenthe modes are quite different in dimension. The lack of sparsityis likely caused by the strong collinearity as well as by the lackof intrinsic sparsity in this type of data. It is questionable if

coclustering as defined here is a suitable model for spectral-likedata such as these. A more suitable approach could be an elasticnet-type coclustering [21], which would allow the natural colli-nearities to be represented in the clusters. This seems like aninteresting research direction.Note that for this particular data set, it would be possible to

integrate the chromatographic peaks and thereby obtaindiscrete data that would be more suitable for coclustering.

Giraffe Cow Lion Gorilla Fly Spider Shark House Horse Elephant Mammoth Sabre TigerPig Cod Eel Dolphin Dog Cat Fox Wolf Rabbit Eagle Seagull Blackbird Bat T. Rex. NeanderthalTriceratopsMan Penguin

Has eyes Carnivor >100kg Dangerous Life expectancy Random Walk on two legs Speed (MPH)

Chicken

Eagle

Seagull

Blackbird

Penguin Cluster 1

Feather

Wings

Has a beak

Walk on two legs

Mammoth

Sabre Tiger

T. Rex.

Neanderthal

TriceratopsCluster 2

Extinct Giraffe Cow House Horse Elephant Mammoth Dolphin T. Rex. Triceratops

Cluster 3

>100kg

>2m

Cow Horse Pig Cod

Eel

Shrimp

Dog

Cat

Rabbit Chicken

Cluster 4

Domesticized Eaten by Caucassians

Cluster 5

Number of legs/arms

Shark

Cod

Eel

Jellyfish

Dolphin

Nemo

Shrimp Cluster 6

Breathe under water

Figure 4. Sparse matrix regression coclusters of animal data. Font size indicates “belongingness” to the cluster.

Chicken Eagle Seagull Blackbird Penguin

Cluster 1

Feather Wings Has a beak Walk on two legs

Random

Shark Cod Eel Jellyfish Dolphin Nemo Shrimp

Cluster 2

Eaten by CaucasiansBreathe under water

Cluster 3

Giraffe Cow Lion Gorilla Fly Spider Shark House Horse Elephant Mammoth Sabre TigerPig Cod Eel Jellyfish Dolphin Nemo Shrimp Dog Cat Fox Wolf Rabbit Chicken Eagle Seagull Blackbird Bat T. Rex. NeanderthalTriceratopsMan Penguin

Has eyes Number of legs/arms Carnivor >100kg Life expectancy Random Speed (MPH) Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random

Cow Horse Pig Cod

Shrimp Dog Cat Rabbit Chicken

Cluster 4

Domesticized Eaten by Caucasians

Fly

Bat Cluster 5

Wings

Random

Giraffe Cow House Horse Elephant Mammoth Dolphin T. Rex. Triceratops

Cluster 6

>100kg>2m

Cluster 7

Lion Gorilla Shark Mammoth Sabre TigerWolf T. Rex. NeanderthalTriceratops

>100kg ExtinctDangerous

Figure 5. Sparse matrix regression coclusters with 30 random variables added to the data.

R. BRO ET AL.

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)

Page 7: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

The intention though, with the given example, is to illustratethe behavior of coclustering on continuous data.

4.3. Cancer

When analyzing the gene expression data, the four differentcancer types come out immediately when we fit a four-coclustermodel as shown in Figure 7, where the four cancer classes arecolor coded. It is apparent that the four cancer classes are per-fectly clustered, but it is also apparent that the gene mode showslittle sparsity in comparison with patients. Hence, coclustering

does not provide the sparsity desired in order to be able to talkmeaningfully of specific biomarkers.

Performing a PCA on the same data (auto-scaled) provides avery clear grouping into the four cancer types (not shown).The separation is not perfect as in Figure 7, but the tendencyis very clear. Lee et al. [13] also performed coclustering with analgorithm similar to the SMR algorithm. The coclustering in thesample space that they obtained resembles the one obtainedusing PCA more than the distinct coclustering obtained inFigure 7. This, however, can be explained by the fact that pen-alties are chosen differently by Lee et al. using a Bayesian

Samp 34Samp 35Samp 36Samp 37Samp 38Samp 39Samp 40Samp 41Samp 42Samp 43Samp 44Samp 45Samp 46Samp 47Samp 48Samp 49Samp 50

Cluster 1

Var 6Var 39Var 42Var 58Var 92Var 125Var 191Var 217Var 301Var 365Var 396Var 399Var 400Var 434Var 435Var 492Var 498Var 547Var 590Var 645Var 646Var 801Var 825Var 868Var 880Var 910Var 911Var 966Var 982Var 997Var 1018Var 1032Var 1066Var 1092Var 1128Var 1132Var 1159Var 1186Var 1205Var 1221Var 1244Var 1273Var 1276Var 1282Var 1296Var 1312Var 1316Var 1443Var 1450Var 1456Var 1493Var 1538Var 1570Var 1686Var 1701Var 1734Var 1735Var 1790Var 1807Var 1836Var 1886Var 1913Var 2053Var 2070Var 2128Var 2147Var 2247Var 2269Var 2270Var 2295Var 2328Var 2338Var 2342Var 2550Var 2554Var 2565Var 2574Var 2575Var 2585Var 2586Var 2593Var 2605Var 2611Var 2616Var 2636Var 2640Var 2664Var 2690Var 2740Var 2772Var 2774Var 2780Var 2798Var 2844Var 2864Var 2873Var 2881Var 2919Var 2932Var 3142Var 3175Var 3187Var 3262Var 3266Var 3291Var 3303Var 3304Var 3325Var 3356Var 3359Var 3362Var 3402Var 3405Var 3414Var 3431Var 3444Var 3479Var 3531Var 3533Var 3534Var 3733Var 3740Var 3765Var 3789Var 3791Var 3801Var 3839Var 3855Var 3903Var 3912Var 3927Var 3981Var 4037Var 4129Var 4144Var 4233Var 4250Var 4275Var 4296Var 4345Var 4347Var 4352Var 4361Var 4404Var 4416Var 4419Var 4449Var 4681Var 4722Var 4753Var 4838Var 4848Var 4920Var 4933Var 5060Var 5064Var 5201Var 5217Var 5227Var 5232Var 5418Var 5772Var 5785Var 5807Var 5898Var 5924Var 5983Var 5994Var 6042Var 6072Var 6149Var 6164Var 6178Var 6214Var 6215Var 6287Var 6485Var 6509Var 6555Var 6557Var 6566Var 6596Var 6606Var 6613Var 6632Var 6640Var 6686Var 6691Var 6693Var 6702Var 6722Var 6723Var 6742Var 6776Var 6814Var 6838Var 6845Var 6944Var 6955Var 6975Var 6998Var 7004Var 7005Var 7010Var 7025Var 7042Var 7045Var 7071Var 7073Var 7076Var 7078Var 7088Var 7090Var 7091Var 7092Var 7093Var 7094Var 7099Var 7106Var 7166Var 7216Var 7236Var 7255Var 7264Var 7269Var 7272Var 7302Var 7316Var 7394Var 7414Var 7447Var 7453Var 7461Var 7464Var 7467Var 7468Var 7473Var 7474Var 7476Var 7478Var 7491Var 7564Var 7584Var 7614Var 7616Var 7670Var 7744Var 7757Var 7761Var 7775Var 7817Var 7821Var 7833Var 7851Var 7939Var 7984Var 7994Var 8027Var 8034Var 8052Var 8103Var 8105Var 8110Var 8115Var 8129Var 8139Var 8154Var 8158Var 8164Var 8172Var 8173Var 8188Var 8189Var 8203Var 8205Var 8216Var 8255Var 8261Var 8272Var 8299Var 8318Var 8378Var 8406Var 8416Var 8434Var 8443Var 8458Var 8459Var 8471Var 8484Var 8486Var 8487Var 8498Var 8510Var 8513Var 8533Var 8534Var 8538Var 8539Var 8589Var 8713Var 8735Var 8773Var 8787Var 8800Var 8831Var 8837Var 8855Var 8858Var 8866Var 8867Var 8873Var 8879Var 8917Var 8926Var 9053Var 9080Var 9124Var 9147Var 9152Var 9167Var 9201Var 9232Var 9269Var 9308Var 9336Var 9406Var 9427Var 9434Var 9439Var 9440Var 9499Var 9619Var 9672Var 9684Var 9723Var 9726Var 9782Var 9787Var 9790Var 9821Var 9852Var 9853Var 9868Var 9874Var 9932Var 10088Var 10114Var 10174Var 10176Var 10186Var 10199Var 10242Var 10379Var 10420Var 10429Var 10465Var 10518Var 10547Var 10555Var 10595Var 10599Var 10618Var 10622Var 10670Var 10685Var 10708Var 10743Var 10766Var 10799Var 10841Var 10869Var 10873Var 10877Var 10895Var 10902Var 10944Var 11004Var 11032Var 11040Var 11057Var 11201Var 11215Var 11270Var 11282Var 11283Var 11297Var 11299Var 11336Var 11401Var 11547Var 11579Var 11613Var 11640Var 11719Var 11730Var 11744Var 11834Var 11846Var 11856Var 11875Var 11881Var 11882Var 11883Var 11921Var 11939Var 11993Var 12045Var 12154Var 12159Var 12160Var 12171Var 12240Var 12241Var 12264Var 12331Var 12333Var 12334Var 12337Var 12338Var 12383Var 12385Var 12437Var 12454Var 12476Var 12487Var 12489Var 12555

Samp 34Samp 35Samp 36Samp 37Samp 38Samp 39Samp 40Samp 41Samp 42Samp 43Samp 44Samp 45Samp 46Samp 47Samp 48Samp 49Samp 50

Var 6Var 39Var 42Var 58Var 92Var 125Var 191Var 217Var 301Var 365Var 396Var 399Var 400Var 434Var 435Var 492Var 498Var 547Var 590Var 645Var 646Var 801Var 825Var 868Var 880Var 910Var 911Var 966Var 982Var 997Var 1018Var 1032Var 1066Var 1092Var 1128Var 1132Var 1159Var 1186Var 1205Var 1221Var 1244Var 1273Var 1276Var 1282Var 1296Var 1312Var 1316Var 1443Var 1450Var 1456Var 1493Var 1538Var 1570Var 1686Var 1701Var 1734Var 1735Var 1790Var 1807Var 1836Var 1886Var 1913Var 2053Var 2070Var 2128Var 2147Var 2247Var 2269Var 2270Var 2295Var 2328Var 2338Var 2342Var 2550Var 2554Var 2565Var 2574Var 2575Var 2585Var 2586Var 2593Var 2605Var 2611Var 2616Var 2636Var 2640Var 2664Var 2690Var 2740Var 2772Var 2774Var 2780Var 2798Var 2844Var 2864Var 2873Var 2881Var 2919Var 2932Var 3142Var 3175Var 3187Var 3262Var 3266Var 3291Var 3303Var 3304Var 3325Var 3356Var 3359Var 3362Var 3402Var 3405Var 3414Var 3431Var 3444Var 3479Var 3531Var 3533Var 3534Var 3733Var 3740Var 3765Var 3789Var 3791Var 3801Var 3839Var 3855Var 3903Var 3912Var 3927Var 3981Var 4037Var 4129Var 4144Var 4233Var 4250Var 4275Var 4296Var 4345Var 4347Var 4352Var 4361Var 4404Var 4416Var 4419Var 4449Var 4681Var 4722Var 4753Var 4838Var 4848Var 4920Var 4933Var 5060Var 5064Var 5201Var 5217Var 5227Var 5232Var 5418Var 5772Var 5785Var 5807Var 5898Var 5924Var 5983Var 5994Var 6042Var 6072Var 6149Var 6164Var 6178Var 6214Var 6215Var 6287Var 6485Var 6509Var 6555Var 6557Var 6566Var 6596Var 6606Var 6613Var 6632Var 6640Var 6686Var 6691Var 6693Var 6702Var 6722Var 6723Var 6742Var 6776Var 6814Var 6838Var 6845Var 6944Var 6955Var 6975Var 6998Var 7004Var 7005Var 7010Var 7025Var 7042Var 7045Var 7071Var 7073Var 7076Var 7078Var 7088Var 7090Var 7091Var 7092Var 7093Var 7094Var 7099Var 7106Var 7166Var 7216Var 7236Var 7255Var 7264Var 7269Var 7272Var 7302Var 7316Var 7394Var 7414Var 7447Var 7453Var 7461Var 7464Var 7467Var 7468Var 7473Var 7474Var 7476Var 7478Var 7491Var 7564Var 7584Var 7614Var 7616Var 7670Var 7744Var 7757Var 7761Var 7775Var 7817Var 7821Var 7833Var 7851Var 7939Var 7984Var 7994Var 8027Var 8034Var 8052Var 8103Var 8105Var 8110Var 8115Var 8129Var 8139Var 8154Var 8158Var 8164Var 8172Var 8173Var 8188Var 8189Var 8203Var 8205Var 8216Var 8255Var 8261Var 8272Var 8299Var 8318Var 8378Var 8406Var 8416Var 8434Var 8443Var 8458Var 8459Var 8471Var 8484Var 8486Var 8487Var 8498Var 8510Var 8513Var 8533Var 8534Var 8538Var 8539Var 8589Var 8713Var 8735Var 8773Var 8787Var 8800Var 8831Var 8837Var 8855Var 8858Var 8866Var 8867Var 8873Var 8879Var 8917Var 8926Var 9053Var 9080Var 9124Var 9147Var 9152Var 9167Var 9201Var 9232Var 9269Var 9308Var 9336Var 9406Var 9427Var 9434Var 9439Var 9440Var 9499Var 9619Var 9672Var 9684Var 9723Var 9726Var 9782Var 9787Var 9790Var 9821Var 9852Var 9853Var 9868Var 9874Var 9932Var 10088Var 10114Var 10174Var 10176Var 10186Var 10199Var 10242Var 10379Var 10420Var 10429Var 10465Var 10518Var 10547Var 10555Var 10595Var 10599Var 10618Var 10622Var 10670Var 10685Var 10708Var 10743Var 10766Var 10799Var 10841Var 10869Var 10873Var 10877Var 10895Var 10902Var 10944Var 11004Var 11032Var 11040Var 11057Var 11201Var 11215Var 11270Var 11282Var 11283Var 11297Var 11299Var 11336Var 11401Var 11547Var 11579Var 11613Var 11640Var 11719Var 11730Var 11744Var 11834Var 11846Var 11856Var 11875Var 11881Var 11882Var 11883Var 11921Var 11939Var 11993Var 12045Var 12154Var 12159Var 12160Var 12171Var 12240Var 12241Var 12264Var 12331Var 12333Var 12334Var 12337Var 12338Var 12383Var 12385Var 12437Var 12454Var 12476Var 12487Var 12489Var 12555

Samp 1Samp 2Samp 3Samp 4Samp 5Samp 6Samp 7Samp 8Samp 9Samp 10Samp 11Samp 12Samp 13Samp 14Samp 15Samp 16Samp 17Samp 19Samp 20

Cluster 2

Var 248Var 421Var 1168Var 1210Var 1265Var 1271Var 1278Var 1833Var 1857Var 1860Var 2094Var 2157Var 2272Var 2274Var 2466Var 2533Var 2535Var 2672Var 2674Var 2676Var 2720Var 2796Var 3079Var 3102Var 3186Var 3347Var 3383Var 3408Var 3458Var 3478Var 3737Var 3738Var 3871Var 3979Var 4305Var 4423Var 4818Var 4888Var 4893Var 5223Var 5295Var 5466Var 5509Var 5653Var 5687Var 5722Var 5797Var 5833Var 6207Var 6219Var 6251Var 6455Var 6670Var 6815Var 6917Var 6918Var 6933Var 6991Var 7056Var 7082Var 7250Var 7279Var 7355Var 7549Var 7560Var 7751Var 7854Var 8109Var 8224Var 8241Var 8364Var 8545Var 8564Var 8689Var 8794Var 8846Var 8887Var 8923Var 8924Var 8970Var 8971Var 9042Var 9112Var 9207Var 9265Var 9354Var 9441Var 9482Var 9663Var 9788Var 9957Var 10099Var 10188Var 10244Var 10261Var 10271Var 10289Var 10298Var 10369Var 10416Var 10553Var 10750Var 10848Var 10911Var 10928Var 10965Var 10966Var 11020Var 11099Var 11174Var 11316Var 11396Var 11459Var 11532Var 11538Var 11739Var 11785Var 11903Var 12032Var 12220Var 12256Var 12406Var 12479Var 12488

Samp 1Samp 2Samp 3Samp 4Samp 5Samp 6Samp 7Samp 8Samp 9Samp 10Samp 11Samp 12Samp 13Samp 14Samp 15Samp 16Samp 17Samp 19Samp 20

Var 248Var 421Var 1168Var 1210Var 1265Var 1271Var 1278Var 1833Var 1857Var 1860Var 2094Var 2157Var 2272Var 2274Var 2466Var 2533Var 2535Var 2672Var 2674Var 2676Var 2720Var 2796Var 3079Var 3102Var 3186Var 3347Var 3383Var 3408Var 3458Var 3478Var 3737Var 3738Var 3871Var 3979Var 4305Var 4423Var 4818Var 4888Var 4893Var 5223Var 5295Var 5466Var 5509Var 5653Var 5687Var 5722Var 5797Var 5833Var 6207Var 6219Var 6251Var 6455Var 6670Var 6815Var 6917Var 6918Var 6933Var 6991Var 7056Var 7082Var 7250Var 7279Var 7355Var 7549Var 7560Var 7751Var 7854Var 8109Var 8224Var 8241Var 8364Var 8545Var 8564Var 8689Var 8794Var 8846Var 8887Var 8923Var 8924Var 8970Var 8971Var 9042Var 9112Var 9207Var 9265Var 9354Var 9441Var 9482Var 9663Var 9788Var 9957Var 10099Var 10188Var 10244Var 10261Var 10271Var 10289Var 10298Var 10369Var 10416Var 10553Var 10750Var 10848Var 10911Var 10928Var 10965Var 10966Var 11020Var 11099Var 11174Var 11316Var 11396Var 11459Var 11532Var 11538Var 11739Var 11785Var 11903Var 12032Var 12220Var 12256Var 12406Var 12479Var 12488

Samp 21Samp 22Samp 23Samp 24Samp 25Samp 27Samp 28Samp 29Samp 30Samp 31Samp 32Samp 33

Cluster 3

Var 258Var 415Var 487Var 520Var 521Var 630Var 633Var 784Var 795Var 990Var 1081Var 1156Var 1206Var 1211Var 1271Var 1366Var 1489Var 1493Var 1624Var 1813Var 1875Var 1904Var 1967Var 1968Var 2052Var 2131Var 2326Var 2327Var 2510Var 2610Var 2611Var 2812Var 2821Var 2847Var 3303Var 3304Var 3336Var 3353Var 3468Var 3860Var 3941Var 4360Var 4383Var 4747Var 4748Var 4840Var 4886Var 4944Var 5243Var 5288Var 5326Var 5360Var 5396Var 5418Var 5467Var 5496Var 5559Var 5698Var 5821Var 5888Var 5983Var 5989Var 6018Var 6128Var 6160Var 6164Var 6192Var 6350Var 6485Var 6517Var 7058Var 7117Var 7254Var 7255Var 7332Var 7444Var 7469Var 7486Var 7494Var 7798Var 7825Var 7884Var 7939Var 7950Var 7967Var 7972Var 8002Var 8216Var 8238Var 8272Var 8340Var 8404Var 8459Var 8502Var 8549Var 8663Var 8667Var 8697Var 8728Var 8889Var 8911Var 9196Var 9206Var 9427Var 9428Var 9442Var 9797Var 10045Var 10053Var 10138Var 10174Var 10431Var 10483Var 10490Var 10544Var 10795Var 10801Var 10838Var 10906Var 11002Var 11003Var 11178Var 11270Var 11373Var 11578Var 11593Var 11728Var 11788Var 11819Var 12139Var 12232Var 12239Var 12329Var 12548

Samp 21Samp 22Samp 23Samp 24Samp 25Samp 27Samp 28Samp 29Samp 30Samp 31Samp 32Samp 33

Var 258Var 415Var 487Var 520Var 521Var 630Var 633Var 784Var 795Var 990Var 1081Var 1156Var 1206Var 1211Var 1271Var 1366Var 1489Var 1493Var 1624Var 1813Var 1875Var 1904Var 1967Var 1968Var 2052Var 2131Var 2326Var 2327Var 2510Var 2610Var 2611Var 2812Var 2821Var 2847Var 3303Var 3304Var 3336Var 3353Var 3468Var 3860Var 3941Var 4360Var 4383Var 4747Var 4748Var 4840Var 4886Var 4944Var 5243Var 5288Var 5326Var 5360Var 5396Var 5418Var 5467Var 5496Var 5559Var 5698Var 5821Var 5888Var 5983Var 5989Var 6018Var 6128Var 6160Var 6164Var 6192Var 6350Var 6485Var 6517Var 7058Var 7117Var 7254Var 7255Var 7332Var 7444Var 7469Var 7486Var 7494Var 7798Var 7825Var 7884Var 7939Var 7950Var 7967Var 7972Var 8002Var 8216Var 8238Var 8272Var 8340Var 8404Var 8459Var 8502Var 8549Var 8663Var 8667Var 8697Var 8728Var 8889Var 8911Var 9196Var 9206Var 9427Var 9428Var 9442Var 9797Var 10045Var 10053Var 10138Var 10174Var 10431Var 10483Var 10490Var 10544Var 10795Var 10801Var 10838Var 10906Var 11002Var 11003Var 11178Var 11270Var 11373Var 11578Var 11593Var 11728Var 11788Var 11819Var 12139Var 12232Var 12239Var 12329Var 12548

Samp 51Samp 52Samp 53Samp 54Samp 55Samp 56

Cluster 4

Var 61Var 427Var 520Var 521Var 545Var 557Var 561Var 641Var 648Var 731Var 830Var 896Var 984Var 1050Var 1211Var 1278Var 1326Var 1358Var 1599Var 2283Var 2292Var 2631Var 3137Var 3138Var 3160Var 3186Var 3303Var 3304Var 3365Var 3379Var 4132Var 4133Var 4136Var 4144Var 4383Var 4424Var 4924Var 5299Var 5363Var 5583Var 5619Var 5753Var 6003Var 6052Var 6111Var 6553Var 7008Var 7056Var 7288Var 7372Var 7374Var 7375Var 7648Var 7823Var 7939Var 8061Var 8142Var 8194Var 8272Var 8716Var 8724Var 8732Var 8923Var 8924Var 9077Var 9196Var 9245Var 9331Var 9421Var 9426Var 9697Var 9769Var 9803Var 10063Var 10084Var 10136Var 10169Var 10213Var 10241Var 10291Var 10511Var 10521Var 10628Var 10643Var 10644Var 10667Var 10720Var 10828Var 11192Var 11270Var 11271Var 11322Var 11338Var 11385Var 11692Var 11939Var 12107Var 12213Var 12455Var 12456Var 12466

Samp 51Samp 52Samp 53Samp 54Samp 55Samp 56

Var 61Var 427Var 520Var 521Var 545Var 557Var 561Var 641Var 648Var 731Var 830Var 896Var 984Var 1050Var 1211Var 1278Var 1326Var 1358Var 1599Var 2283Var 2292Var 2631Var 3137Var 3138Var 3160Var 3186Var 3303Var 3304Var 3365Var 3379Var 4132Var 4133Var 4136Var 4144Var 4383Var 4424Var 4924Var 5299Var 5363Var 5583Var 5619Var 5753Var 6003Var 6052Var 6111Var 6553Var 7008Var 7056Var 7288Var 7372Var 7374Var 7375Var 7648Var 7823Var 7939Var 8061Var 8142Var 8194Var 8272Var 8716Var 8724Var 8732Var 8923Var 8924Var 9077Var 9196Var 9245Var 9331Var 9421Var 9426Var 9697Var 9769Var 9803Var 10063Var 10084Var 10136Var 10169Var 10213Var 10241Var 10291Var 10511Var 10521Var 10628Var 10643Var 10644Var 10667Var 10720Var 10828Var 11192Var 11270Var 11271Var 11322Var 11338Var 11385Var 11692Var 11939Var 12107Var 12213Var 12455Var 12456Var 12466

Figure 7. Sparse matrix regression coclusters of cancer data color-coded according to cancer class.

20 40 60 80 100

1

2

3

Sample number

Clu

ster

bel

ongi

ngne

ss

Cluster 1

100 200 300 400 500 6000

1

2

3

4

Retention

Clu

ster

bel

ongi

ngne

ss

20 40 60 80 1000

1

2

3

4

5

Sample number

Cluster 2

100 200 300 400 500 6000

1

2

Retention

20 40 60 80 1000

1

2

Sample number

Cluster 3

100 200 300 400 500 6000

0.4

0.8

1

1.4

Retention

Non oliveOliveMix

Non oliveOliveMix

Non oliveOliveMix

Figure 6. Three sparse matrix regression clusters are shown. Top plots show sample clusters and bottom plots show elution time clusters.

Coclustering

J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

Page 8: Coclustering a useful tool for chemometricsepapalex/papers/cem1424.pdf · Coclustering—a useful tool for chemometrics Rasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara and Nicholas

information criterion. Regardless, as also observed with theSMR algorithm, the algorithm of Lee et al. produces solutionsthat are not as sparse as expected in the gene mode.

5. CONCLUSION

The basic principles behind coclustering have been explained,and a new model and algorithm have been favorably comparedwith common methods such as PCA. It is shown that coclusteringcan provide meaningful and easily interpretable results on bothfairly simple and complex data compared with more traditionalapproaches. Limitations were encountered when the number ofirrelevant samples grew too high and when spectral-like dataare analyzed. More elaborate algorithms need to be developedfor handling such situations.

Acknowledgements

N. Sidiropoulos was supported in part by ARO grant W911NF-11-1-0500.

REFERENCES1. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering with Bregman

divergences. J. Mach. Learn Res. 2005; 6: 1705–1749.2. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd

C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, LanderER, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M.Classification of human lung carcinomas by mRNA expression profil-ing reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci.U.S.A. 2001; 98: 13790–13795.

3. Cho H, Dhillon IS, Guan Y, Sra S. Minimum sum-squared residueco-clustering of gene expression data. Proceedings of the Fourth SIAMInternational Conference on Data Mining 2004; 114–125.

4. Damian D, Oresic M, Verheij E, Meulman J, Friedman J, Adourian A,Morel N, Smilde A, van der Greef J. Applications of a new subspaceclustering algorithm (COSA) in medical systems biology. Metabolomics2007; 3: 69–77.

5. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-RodriguezL. Discriminating olive and non-olive oils using HPLC-CAD andchemometrics. Anal. Bioanal. Chem. 2011a\; 399: 2083–2092.

6. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-RodriguezL. Olive oil quantification of edible vegetable oil blends using triacyl-glycerols chromatographic fingerprints and chemometric tools.Talanta 2011b; 85: 177–182.

7. Dhillon IS. Co-clustering documents and words using bipartitespectral graph partitioning. Proceedings of the seventh ACM SIGKDDinternational conference on Knowledge discovery and data mining2001; 269–274.

8. Dhillon IS, Mallela S, Modha DS. Information-theoretic co-clustering.Proceedings of the Ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining 2003; 89–98.

9. Friedman JH, Meulman JJ. Clustering objects on subsets of attributes.J. Roy. Stat. Soc. B Stat. Meth. 2004; 66: 815–849.

10. Hageman JA, van den Berg RA, Westerhuis JA, van der Werf MJ,Smilde AK. Genetic algorithm based two-mode clustering of meta-bolomics data. Metabolomics 2008; 4: 141–149.

11. Hartigan JA. Direct clustering of a data matrix. J. Am. Stat. Assoc.1972; 67: 123–129.

12. Kaiser HF. The varimax criterion for analytic rotation in factor analysis.Psychometrika 1958; 23: 187–200.

13. Lee M, Shen H, Huang JZ, Marron JS. Biclustering via sparse singularvalue decomposition. Biometrics 2010; 66: 1087–1095.

14. Liu Y, Hayes DN, Nobel A, Marron JS. Statistical significance ofclustering for high-dimension, low-sample size data. J. Am. Stat.Assoc. 2008; 103: 1281–1293.

15. Madeira SC, Oliveira AL. Biclustering algorithms for biological data anal-ysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004; 1: 24–45.

16. Osborne MR, Presnell B, Turlach BA. On the LASSO and its dual. J.Comput. Graph. Stat. 2000; 9: 319–337.

17. Papalexakis EE, Sidiropoulos ND. Co-clustering asmultilinear decompo-sition with sparse latent factors. 2011 IEEE International Conference onAcoustics, Speech and Signal Processing, Prague, Czech Republic, 2011.

18. Papalexakis EE, Sidiropoulos ND, Garofalakis MN. Reviewer profilingusing sparse matrix regression. 2010 IEEE International Conferenceon Data Mining Workshops, 2010; 1214–1219.

19. Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy.Stat. Soc. B 1996; 58: 267–288.

20. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition,with applications to sparse principal components and canonicalcorrelation analysis. Biostatistics 2009; 10: 515–534.

21. Zou H, Hastie T. Regularization and variable selection via the elasticnet. J. Roy. Stat. Soc. B 2005; 67: 301–320.

R. BRO ET AL.

wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)