l14 13s regression clustering - service...

Regression Clustering

In regression clustering, we assume a model of the form y =

fg(x, θg) + εg for observations y and x in the gth group.

Usually, of course, we assume linear models of the form y =

xTβg+εg, and assume εg ∼ N(0, σ2g ) and observations are mutually

independent.

The distribution of the error term allows us to formulate a like-

lihood, and this provides us the necessary quantities for the EM

method.

1

EM Methods

While an EM method is relatively easy to program, the R package

flexmix developed by Friedrich Leisch (2004) provides a simple

interface for an EM method for various kinds of regression mod-

els.

The package allows models of different forms for each group.

It uses the classes and methods of R and so is very flexible.

The M-step is viewed as a fitting step, and the structure of

the package makes it relatively simple matter to use a different

fitting method, such as constrained or penalized regression.

2

Variable Selection within the Groups

As a practical matter, it is generally convenient to fit a model of

the same form and with the same covariates within each group.

A slight modification is to select different covariates within each

group. Under the usual set up with models of the form y =

xTβg + εg this has no effect on the formulation of the likelihood,

but it does introduce the additional step of variable selection in

the M-step.

One way of approaching this is to substitute a lasso fit for the

usual LS (i.e. ML) fit. The R package lasso2 developed by

Lokhorst, Venables, and Turlach (2006) provides a parameter

to lasso fitting that will drive insignificant coefficients to zero.

Alternatively, a lars approach coupled with use of the L-curve

could be used to determine a fit.

3

Other Variations on the M-Step

Rather than viewing the M-step as part of a procedure to maxi-

mize a likelihood, we can view it as a step to fit a model using

any criterion.

This, of course, changes the basic approach in finding groups in

data based on regression models so that it is no longer based on

MLE.

Even while using an EM method, the approach now is based on

a heuristic notion of good fits of individual models and clustering

the observations based on best fits.

4

Clustering Based on Closeness

Elements within a cluster are close to each other.

If we define a distance as a dissimilarity for a given element to

some overall measure of a given cluster, the clustering problem

is to minimize the dissimilarities within clusters.

In some cases it is useful to consider a fuzzy cluster membership,

but in the following, we will assume that each observation is in

exactly one cluster.

We denote the dissimilarity of the observation yi to the other

members of the gth cluster as dg(yi), and we define dg(yi) = 0 if

yi is not in the gth cluster.

5


A given clustering is effectively a partitioning of the dataset. We

denote the partition by P , which is a collection of disjoint sets

of indices whose union is the set of all indices in the sample,

P = {P1, . . . , Pk}.

The sum of the discrepancies

f(P ) =k∑

g=1

n∑

i=1

dg(yi).

is a function of the clustering.

For a fixed number of clusters, k, this is the objective function

to be minimized with respect to partitioning P .

This is the basic idea of k means clustering.

6


Singleton clusters need special consideration, as does the number

of clusters, k.

Depending on how we define the discrepancies, and how we mea-

sure the “discrepancy” attributable to a singleton cluster, we

could incorporate choice of k into the objective function.

7

Clusters of Models

For y in the gth group, the discrepancy is a function of the

observed y and its predicted or fitted value,

dg(yi) = hg(yi, fg(xi, θg)),

where hg(yi, ·) = 0 if yi is not in the gth cluster.

In many cases,

hg(yi, fg(xi, θg)) = hg(yi − fg(xi, θg));

that is, the discrepancy is a function of the difference of the

observed y and its fitted value.

8

Measures of Dissimilarity

The measure of dissimilarity is a measure of the distance of a

given observation to the “center” of the group of which it is a

member.

There are two aspects to measures of dissimilarity.

• the type of centers — mean, median, harmonic mean

• the type of distance measure

The distance measure is usually a norm, and most often an Lp

norm — L1, L2, L∞.

9

K-Means Type Methods

In K-means clustering, the objective function is

f(P ) =k∑

g=1

∑

i∈Pg

‖yi − yg‖2,

where yg is the mean of the observations in the gth group.

In K-models clustering, yg is replaced by fg(x, ˆthetag).

10

The data shown are similar to some astronomical data from the

Sloan Digital Sky Survey (SDSS).

The data are two measures of absolute brightness of a large

sample of celestial objects.

An astronomer had asked for our help in analyzing the data.

(The data in the figure are not the original data; that dataset

was massive, and would not show well in a simple plot.)

The astronomer wanted to fit a regression of one measure on

the other.

11

We could fit some model to the data, of course, but the question

is what kind of model? Four possibliities are

• straight line

• curved line (polynomial? exponential?)

• segmented straight lines

• overlapping functions

13

Objectives

As in any data analysis, we must identify and focus on the objec-

tive. If the objective is prediction of one variable given another,

some kind of single model would be desirable.

Adopting a more appropriate attitude toward the problem, how-

ever, we see that there is something more fundamental going

on.

It is clear that if we are to have any kind of effective regression

model, we need another independent variable.

We might ask whether there are groups of different types of

objects as suggested by the different models for different subsets

of the data.

We could perhaps cluster the data based on model fits.

Then if we really want a single regression model, a cluster iden-

tifier variable could allow us to have one.

15

Clusters

We can take a purely data-driven approach to defining clusters.

From this standpoint, clusters are clusters because

• the elements within a cluster are closer to one another, or

they are dense,

• the elements within a cluster follow a common distribution,

or

• the variables (attributes) in all elements of a cluster have

similar relationships among each other.

16

In an extension of a data-driven approach, we may identify clus-

ters based on some relationship amont the variables.

The relationship is expressed as a model; perhaps a linear regres-

sion model. In this sense, the clusters are “conceptual clusters”.

The clusters are clusters because a common model fits their

elements.

17

Issues in Clustering

Although we may define a clustering problem in terms of a finite

mixture distribution, clustering problems are often not built on a

probability model.

The clustering problem is usually defined in terms of an objective

function to minimize, or in terms of the algorithm that solves the

problem.

In most mixture problems we have an issue of identifiability. The

meanings of the group labels cannot be determined from the

data, so any solution can be unique only up to permutations of

the labels.

Another type of identifiability problem arises if the groups are

not distinct (or, in practice, sufficiently distinct). This is similar

to an over-parameterized model.

19

Clusters of Models

In regression modeling, we treat one variable as special, and treat

other variables as covariates; that is, in addition to the variable

of interest y, there is are associated variables x, which is the

vector of all other relevant variables. (The variable of interest

may also be vector-valued of course.) The regression models

have the general form y = f(x, θ) + ε.

To allow the models to be different in different clusters, we may

denote the systematic component of the model in the gth group

as fg(x, θg). This notation allows retention of the original labels

of the dataset.

20

Approaches

There are essentially two ways of approaching the problem. They

arise from slightly different considerations of why clusters are

clusters. These are based on combining the notion of similar

relationships among the variables with

• the property of a common probability distribution,

or else with

• the property of closeness or density of the elements.

If we assume a common probability distribution for the random

component of the models, we can write a likelihood, conditional

on knowing the class of each observation.

From the standpoint of clusters defined by closeness, we have

an objective function that involves norms of residuals.

21

Clustering Based on a Probability Distribution

Model

If the number of clusters is fixed to be k, say, and if the data

in each cluster are considered to be a random sample from a

given family of probability distributions, we can formulate the

clustering problem as a maximum likelihood problem.

For a mixture of k distributions, if the PDF of the jth distribution

is pj(x; θj), the PDF of the mixture is

p(x; θ) =k∑

j=1

πjpj(x; θj),

where πj ≥ 0 and∑k

j=1 πj = 1.

The g-vector π is the unconditional probabilities for a random

variable from the mixture.

22

EM Methods

If we consider each observation to have an additional variable

that is not observed, we are led to the classic EM formulation of

the mixture problem. We define k 0-1 dummy variables to indi-

cate the group to which an observation belongs. These dummy

variables are the missing data in EM formulation of the mixture

problem. The complete data in each observation is C = (Y, U, x),

where Y is an observed random variable, U is an unobserved ran-

dom variable, and x is a (vector) observed covariate.

The E-step yields conditional expectations of the dummy vari-

ables. For each observation, the conditional expectation of a

given dummy variable can be interpreted as the provisional prob-

ability that the observation is from the population represented

by that dummy variable.

The M-step yields an optimal fit of the model in each group,

using the group inclusion probabilities as weights.

23

Classification Variables

The conditional expectations of the 0-1 dummy variables can

be viewed as probabilities that each observation is in the group

represented by the dummy variable. There are two possible ways

of treating the dummy classification variables viewed as proba-

bilities. One way is to use these values as weights in fitting the

model at each step. This way usually results in less variability

the EM steps.

If it is not practical to use a weighted fit in the M-step, each

observation can be assigned to a single group.

Another way at the conclusion of the EM computations, is to

assign If each observation is assigned to the group corresponding

to the dummy variable with the largest associated conditional

expectation, we can view this as maximizing a “classification

likelihood” (see Fraley and Raftery, 2002).

24

We could also use the conditional expectations of the dummy

variables as probabilities for a random assignment of each obser-

vation to a single group if a weighted fit is not practical.

Fuzzy Membership

Interpreting the conditional expected values of the classification

variables as probabilities naturally leads to the idea of fuzzy group

membership. In the case of only two groups, we may separate

the observations into three sets, two sets corresponding to the

two groups, and one set that is not classified.

This would be based on some threshold value, α > 0.5. If the

conditional expected value of a classification variable is greater

than α, the observation is put in the cluster corresponding to that

variable; otherwise, the observation is not put in either cluster.

25

In the case of more than two clusters, the interpretation of the

classification variables can be extended to represent likely mem-

bership in some given cluster, membership in two given clusters,

or in some combination of any number of clusters. If the likely

cluster membership is dispersed among more than two or three

clusters, however, it is probably best just to leave that observa-

tion unclustered. There are other situations, such as with outliers

from all models, in which it may be best to leave an observation

unclustered.

27

Issues with an EM Method

The EM method is based on a rather strong model assump-

tion so that a likelihood can be formulated. We can take a

more heuristic approach, however, and merely view the M step

as model fitting using any reasonable objective function. Instead

of maximizing an identified likelihood, we could perform a model

fit by minimizing some norm of the residuals, whether or not this

corresponds to a maximization of a likelihood.

There are other problems that often occur in the use of EM

methods. A common one is that the method may be very slow

to converge. Another major problem in applications such as

mixtures is that there are local optima. This particular problem

has nothing to do with EM per se, but rather with any method

we may use to solve the problem. Whenever local optima may be

present, there are two standard ways of addressing the problem.

One is to use multiple starting points, and the other is to allow

28

an iteration to go in a suboptimal direction. The only one of

these approaches that would be applicable in the model-based

clustering would be the use of multiple starting points. We did

not explore this approach in the present research.

Regression Clustering

In regression clustering, we assume a model of the form y =

fg(x, θg) + εg for observations y and x in the gth group.

Usually we assume linear models of the form y = xTβg + εg, and

assume that εg ∼ N(0, σ2g ) and that observations are mutually

independent.

The distribution of the error term allows us to formulate a like-

lihood, and this provides us the necessary quantities for the EM

method.

29

EM Methods

While an EM method is relatively easy to program, the R package

flexmix developed by Leisch (2004) provides a simple interface

for an EM method for various kinds of regression models. In our

experience with EM methods as implemented in this package,

we rarely had problems with the EM methods being slow slow

to converge in the clustering applications. We also did not find

that they were particularly sensitive to the starting values (see

Li and Gentle, 2007).

The M-step is viewed as a fitting step, and the structure of

the package makes it relatively simple matter to use a different

fitting method, such as constrained or penalized regression.

30

Models with Many Covariates

Models with many covariates are more interesting. In such cases,

however, it is likely that different sets of covariates are appropri-

ate for different groups.

Use of all covariates would lead to overparametrized models,

and hence the fits have larger variance. While this may still

result in an effective clustering, it would seriously degrade the

performance of any classification scheme based on the fits.

31

Variable Selection within the Groups

As a practical matter, it is generally convenient to fit a model of

the same form and with the same covariates within each group.

A slight modification is to select different covariates within each

group. Under the usual set up with models of the form y =

xTβg + εg this has no effect on the formulation of the likelihood,

but it does introduce the additional step of variable selection in

the M-step.

Although models with different sets of independent variables can

be incorporated in the likelihood, the additional step of variable

selection can result present problems of computational conver-

gence, as well as major analytic problems.

For variable selection in regression clustering, we need a proce-

dure that is automatic.

32

Penalized Likelihood for Variable Selection

within the Groups

A lasso fit for variable selection can be inserted naturally in the

M-step of the EM method; that is, instead of the usual least

squares fit, which corresponds to maximum likelihood in the case

of a known model with normally distributed error, we minimize

‖uig(yi − xTi bg)‖2 + λ‖bg‖1

We could interpret this as maximizing a penalized likelihood.

Rather than viewing the M-step as part of a procedure to maxi-

mize a likelihood, we can view it as a step to fit a model using

any reasonable criterion. This, of course, changes the basic ap-

proach in finding groups in data that are based on regression

models so that it is no longer based on maximum likelihood es-

timation, but the same upper-level computational methods can

be used.

33

Alternatively, a lars approach coupled with use of the L-curve

could be used to determine a fit, but either way, the lasso fit

often yields models with some fitted coefficients exactly 0.

The use of lasso of course biases the estimators of the selected

variables downward. The overall statistical properties of the vari-

able selection procedure are not fully understood. Lasso fitting

seems useful within the EM iterations, however.

At the end, the variables selected within the individual groups

can be fitted by regular, that is, nonpenalized least squares.

Even while using an EM method, the approach now is based on

a heuristic notion of good fits of individual models and clustering

of the observations based on best fits.


The idea of forming clusters based on model fits leads us to

the general idea of clustering based on closeness to a model

“center”. Elements within a cluster are close to each other.

If we define a distance as a dissimilarity for a given element to

some overall measure of a given cluster, the clustering problem

is to minimize the dissimilarities within clusters.

In some cases it is useful to consider a fuzzy cluster membership,

but in the following, we will assume that each observation is in

exactly one cluster.

We denote the dissimilarity of the observation yi to the other

members of the gth cluster as dg(yi), and we define dg(yi) = 0 if

yi is not in the gth cluster.

34

A given clustering is effectively a partitioning of the dataset. We

denote the partition by P , which is a collection of disjoint sets

of indices whose union is the set of all indices in the sample,

P = {P1, . . . , Pk}.

The sum of the discrepancies

f(P ) =k∑

g=1

n∑

i=1

dg(yi).

is a function of the clustering.

For a fixed number of clusters, k, this is the objective function

to be minimized with respect to partitioning P . This of course

is the basic idea of k means clustering.

In any kind of clustering method, singleton clusters need special

consideration. Such clusters may be more properly considered as

outliers, and their numbers do not contribute to the total count

of the number of clusters, k.

The number of clusters is itself an important characteristic of the

problem. In some cases our knowledge of the application may

lead to a known number of clusters, or at least it may lead to an

appropriate choice of k. Depending on how we define the dis-

crepancies, and how we measure the “discrepancy” attributable

to a singleton cluster, we could incorporate choice of k into the

objective function.

Clusters of Models

For y in the gth group, the discrepancy is a function of the

observed y and its predicted or fitted value,

dg(yi) = hg(yi, fg(xi, θg)),

where hg(yi, ·) = 0 if yi is not in the gth cluster.

In many cases,

hg(yi, fg(xi, θg)) = hg(yi − fg(xi, θg));

that is, the discrepancy is a function of the difference of the

observed y and its fitted value.

35

Measures of Dissimilarity

The measure of dissimilarity is a measure of the distance of a

given observation to the “center” of the group of which it is a

member.

There are two aspects to measures of dissimilarity.

• the type of centers — mean, median, harmonic mean; this

is the fg(xi, θg) above.

• the type of distance measure; this is the hg(yi − fg(xi, θg))

above.

The type of center, for example, whether it is based on a least

squares criterion such as a mean or based on a least absolute

36

values criterion such as a median, affects the robustness of the

clustering procedure.

Zhang and Hsu (1999) showed that if harmonic means are used

instead of means in k-means clustering, the clusters are less

sensitive to the starting values.

Zhang (2003) used a harmonic average for the regression cluster-

ing problem; that is, instead of using the within-groups residual

norms, he used a harmonic mean of the within-groups residuals.

The insensitivity of a harmonic average to outlying values may

cause problems when the groups are not tightly clustered within

the model predictions. Nevertheless, this approach seems promis-

ing, but more studies under different configurations are needed.

The type of distance measure is usually a norm of the coordinate

differences of a given observation and the center. Most often

this is an Lp norm — L1, L2, L∞. It may seem natural that the

distance of an observation to the center be based on the same

type of measure as the measure used to define the center, but

this is not necessary.

K-Means Type Methods

In k-means clustering, the objective function is

f(P ) =k∑

g=1

∑

i∈Pg

‖yi − yg‖2,

where yg is the mean of the observations in the gth group.

In K-models clustering, yg is replaced by fg(x, ˆthetag).

With the model predictions are used as the centers, it is the

same as the substitution method used in a univariate k-means

clustering algorithm.

37

K-means clustering is a combinatorial problem, and the methods

are computationally complex.

The most efficient methods currently are based on simulated

annealing with substitution rules.

These methods can allow the iterations to escape from local

optima. Because of the local optima, however, any algorithm

for k-means clustering is likely to be sensitive to the starting

values.

As in any combinatorial optimization problem, the performance

depends on the method of choosing a new trial point, and the

cooling schedule. We are currently investigating these steps in a

simulated annealing method for regression clustering, but don’t

have any useful results yet.

38

K-Models Clustering Following Clustering of the

Covariates

When the covariates have clusters among themselves, a simple

clustering method applied only to them may yield good starting

values for either an EM method or a k-means method for the

regression clustering problem.

There may be other types of prior information about the group

membership of the individual observations. Any such informa-

tion, either from group assignments based on clustering of the

covariates or from prior assumptions, can be used in the compu-

tation of the expected values of the classification variables.

Clearly, clustering of covariates has limited effectiveness. We

will be trying to characterize distributional patterns to be able

to tell when preliminary clustering of covariates is useful in the

regression clustering problem.

39

l14 13s regression clustering - service...

Documents