clustering i - sharif

61
Machine Learning Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/ 93-94/2/ce717-1

Upload: others

Post on 24-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering I - Sharif

Machine Learning

Clustering I

Hamid R. RabieeJafar Muhammadi, Nima Pourdamghani

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1

Page 2: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course2

Agenda

Unsupervised Learning

Quality Measurement

Similarity Measures

Major Clustering Approaches

Distance Measuring

Partitioning Methods

Hierarchical Methods

Density Based Methods

Spectral Clustering

Other Methods

Constraint Based Clustering

Clustering as Optimization

Sharif University of Technology, Computer Engineering Department, Machine Learning Course2

Page 3: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course3

Unsupervised Learning

Clustering or unsupervised classification is aimed at discovering natural groupings in a set

of data.

Note: All samples in the training set are unlabeled.

Applications for clustering:

Spatial data analysis: Create thematic maps in GIS by clustering feature space

Image processing: Segmentation

Economic science: Discover distinct groups in costumer bases

Internet: Document classification

To gain insight into the structure of the data prior to classifier design; classifier design

Page 4: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course4

Quality Measurement

High quality clusters must have

high intra-class similarity

low inter-class similarity

Some other measures

Ability to discover hidden patterns

Judged by the user

Purity

Suppose we know the labels of the data, assign to each cluster its most frequent class

Purity is the number of correctly assigned points divided by the number of data

Page 5: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course5

Similarity Measures

Distances are normally used to measure the similarity or dissimilarity between two

data objects

Some popular distances are Minkowski and Mahalanobis.

Distance between binary strings

d(S1,S2)=|{(s1,i,s2,i) : s1,i ≠ s2,i}|

Distance between vector objectsTX .Y

d(X,Y)X Y

Page 6: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course6

Major Clustering Approaches

Partitioning approach

Construct various partitions and then evaluate them by some criterion (ex. k-means, c-means, k-medoids)

Hierarchical approach

Create a hierarchical decomposition of the set of data using some criterion (ex. Agnes)

Density-based approach

Based on connectivity and density functions (ex. DBSACN, OPTICS)

Graph-based approach (Spectral Clustering)

approximately optimizing the normalized cut criterion

Grid-based approach

based on a multiple-level granularity structure (ex. STING, WaveCluster, CLIQUE)

Model-based

A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (ex.

EM, SOM)

Page 7: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course7

Distance Measuring

Single link

smallest distance between an element in one cluster

and an element in the other

Complete link

largest distance between an element in one cluster

and an element in the other

Average

avg distance between an element in one cluster and an element in the other

Centroid

distance between the centroids of two clusters

Used in k-means

Medoid

distance between the medoids of two clusters

Medoid: A representative object whose average dissimilarity to all the objects in the cluster is minimal

Page 8: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course8

Partitioning Methods

Construct a partition of n data into a set of k clusters, s.t., min sum of squared

distance

where Cms are clusters representatives.

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means, c-means and k-medoids algorithms

k-means: Each cluster is represented by the center of the cluster

c-means: The fuzzy version of k-means

k-medoids: Each cluster is represented by one of the samples in the cluster

j m

k 2

m 1 x Cluster j mmin (x C )

Page 9: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course9

Partitioning Methods: k-means

k-means

Suppose we know there are K categories and each category is represented by its

sample mean

Given a set of unlabeled training samples, how to estimate the means?

Algorithm k-means (k)

1. Partition samples into k non-empty subsets (random initialization)

2. Compute mean points of the clusters of the current partition

3. Assign each sample to the cluster with the nearest mean point

4. Go back to Step 2, stop when no more new assignment

Page 10: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course10

Partitioning Methods: k-means

Some notes on k-means

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers (Why?)

Not suitable to discover clusters with non-convex shapes (Why?)

Algorithm is sensitive to

number of cluster centers,

choice of initial cluster centers

sequence in which data are processed (Why?)

Convergence not guaranteed, but results acceptable if there are well-separated

clusters

Page 11: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course11

Partitioning Methods: c-means

The membership function μil expresses to what degree xl belongs to class Ci.

Crisp clustering: xl can belong to one class only

Fuzzy clustering: xl belongs to all classes simultaneously with varying degrees of membership

where z(m)s are cluster means

q is a fuzziness index with 1<q<2

Fuzzy clustering becomes crisp clustering when q→1

Observe that

C-mean minimizes

1

0

l i

il

l i

if x C

if x C

1

1

( )

1

1

( )1

1

( , )

1

( , )

q

m

i l

il

qk

mii l

d z x

d z x

11, 1,2,..., .

k

ilifor l N

2( )

1 1, ( )

k Nf f q m

e i i il i li lJ J J z x

Page 12: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course12

Partitioning Methods: k-medoids

k-medoids

Instead of taking the mean value of the samples in a cluster as a reference point, medoids

can be used

Note that choosing the new medoids is slightly different with choosing the new means in k-

means algorithm

Algorithm k-medoids (k)

1. Select k representative samples arbitrarily

2. Associate each data point to the closest medoid

3. For each medoid m and data point o

Swap m and o and compute the total cost of configuration

4. Select the configuration with the lowest cost

5. repeat steps 2-5 until there is no change

Page 13: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course13

Partitioning Methods: k-medoids

Some notes on k-medoids

k-medoids is more robust than k-means in the presence of noise and outliers (Why?)

works effectively for small data sets, but does not scale well for large data sets

For Large data sets we can use sampling based methods (How?)

Page 14: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course14

Hierarchical Methods

Clusters have sub-clusters and sub-clusters can have sub-sub-clusters, …

Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an input, but needs a

termination condition

Step 0 Step 1 Step 2 Step 3

Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

a

b

c

d

e

a b

d ec d e

a b c d e

Page 15: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course15

Hierarchical Methods

Agglomerative Hierarchical Clustering

AGNES (Agglomerative Nesting)

Uses the Single-Link method

Merge nodes (clusters) that have the maximum similarity

divisive Hierarchical Clustering

DIANA (Divisive Analysis)

Inverse order of AGNES

Eventually each node forms a cluster on its own

Page 16: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course16

Hierarchical Methods

Dendrogram

Shows How the Clusters are Merged

Decompose samples into a several levels of nested partitioning (tree of clusters), called a

dendrogram.

A clustering of the samples is obtained by cutting the dendrogram at the desired level, then

each connected component forms a cluster.

Page 17: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course17

Density Based Methods

Clustering based on density (local cluster criterion), such as density-connected

points

Major features:

Discover clusters of arbitrary shapes

Handle noise

Need density parameters as termination condition

Page 18: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course18

Density Based Methods

Main Concepts:

parameters:

Eps: Maximum radius of the neighborhood

MinPts: Minimum number of points in an Eps-neighbourhood

of that point

Sample q is directly density-reachable from sample p, if

d(p,q)<=Eps and p has MinPts points in its neighborhood.

Sample q is density-reachable from a sample p if there is a chain

of points p1, …, pn, p1 = p, pn = q such that pi+1 is directly density-

reachable from pi

Sample p is density-connected to sample q if there is a sample o

such that both, p and q are density-reachable from o.

p q

o

q

pp1

Page 19: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course19

Density Based Methods: DBSCAN

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of

density-connected points

Discovers clusters of arbitrary shape in spatial data with noise

Algorithm DBSCAN (Eps, MinPts)

Arbitrary select a sample p

Retrieve all samples density-reachable from p w.r.t. Eps and MinPts.

If p is a core sample (some samples are density-reachable from p), a cluster is formed.

If p is a border sample (no samples are density-reachable from p), DBSCAN visits the next

sample of the database.

Continue the process until all of the samples have been processed.

Page 20: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course20

Graph-based Clustering

Represent data points as the vertices V of a graph G.

All pairs of vertices are connected by an edge E.

Edges have weights W.

Large weights mean that the adjacent vertices are very similar; small weights imply

dissimilarity.

Page 21: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course21

Graph-based Clustering

Clustering on a graph is equivalent to partitioning the vertices of the graph.

A loss function for a partition of V into sets A and B

In a good partition, vertices in different partitions will be dissimilar.

Mincut criterion: Find a partition A, B that minimizes cut(A,B)

Mincut criterion ignores the size of the subgraphs formed

,

,

( , ) u v

u A v B

cut A B W

Page 22: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course22

Graph-based Clustering

Normalized cut criterion favors balanced partitions.

Minimizing the normalized cut criterion exactly is NP-hard.

One way of approximately optimizing the normalized cut criterion leads to

spectral clustering.

, ,

, ,

( , ) ( , )( , )

u v u v

u A v V u B v V

cut A B cut A BNcut A B

W W

Page 23: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course23

Spectral Clustering

Spectral clustering

Looks for a new representation of the original data points, such that

Preserve the edge weights.

The convex clusters’ shapes in the new space represents non-convex ones in the

original space.

Cluster the points in the new space using any clustering scheme (say k-means).

We only describe the resulting algorithm here.

For more information about derivations, refer to U. Luxburg, “A Tutorial on

Spectral Clustering”.

Page 24: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course24

Spectral Clustering

Inputs

Set of points and number of clusters k

Algorithm

Form the edge weights matrix

For example:

Scaling parameter chosen by user

Define D a diagonal matrix whose (i,i) element is the sum of W’s row i

Form the matrix

Find the k largest eigenvectors of L to form the matrix Xnxk

nxnW R

1/2 1/2L D WD

2 2/2

0

i js s

ij

e if i jWelse

1 ,..., l

nS S S R

Page 25: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course25

Spectral Clustering

Algorithm (cont.)

Normalized the matrix Xnxk and form matrix Ynxk

Treat each row of Y as a point in Rk (data dimensionality reduction from n to k)

Cluster the new data into k clusters via K-means

2/ij ij ijj

Y X X

Page 26: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course26

Spectral Clustering

Example

simple edge weights matrix (d(xi,xj) denotes Euclidean distance between points

xi, xj and θ=1)

1   ,  

, ,0

i jif d x xW i j W j i

otherwise

1 (0.7,0.7,0,0)Te

2 (0,0,0.7,0.7)Te 1 (0.7,0,0.7,0)Te

2 (0,0.7,0,0.7)Te

~1 1 0 0 1 0 1 0

1 1 0 0 0 1 0 1

0 0 1 1 1 0 1 0

0 0 1 1 0 1 0 1

a b c d a c b d

a a

W Wb c

c b

d d

b

c1 d

a

Page 27: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course27

Spectral Clustering

Another example

Page 28: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course28

Other Methods

Grid based methods

Using multi-resolution grid data structure.

1. Create the grid structure, i.e., partition the data space into a finite number of cells

2. Calculate the cell density for each cell

3. Sort the cells according to their densities

4. Identify cluster centers

5. Traverse the neighbor cells

Page 29: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course29

Other Methods

Model based methods

Attempt to optimize the fit between the given data and some mathematical

model

Based on the assumption: Data are generated by a mixture of underlying

probability distribution

Typical methods

Statistical approach: EM (Expectation maximization) – will be discussed later

Neural network approach: SOM (Self-Organizing Feature Map)

Page 30: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course30

Constraint Based Clustering

Why constraint based clustering?

Need user feedback: Users know their applications the best

Less parameters but more user-desired constraints, e.g., an ATM allocation problem:

obstacle & desired clusters

Different constraints in cluster analysis:

Constraints on individual samples (do selection first)

Cluster on samples which …

Constraints on distance or similarity functions

Weighted functions, obstacles

Constraints on the selection of clustering parameters

Number of clusters, limitation of each cluster size

User-specified constraints

Some samples must be in cluster and some others not!

Semi-supervised: giving small training sets as “constraints” or hints

Page 31: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course31

Constraint Based Clustering

A sample data and two answers (taking the constraints into account and not taking

the constraints into account)

Constraints: The data in different sides of each “wall” should be in different

clusters

Page 32: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course32

Clustering as Optimization

Clustering can be posted as an optimization of a criterion function

The sum-of-squared-error criterion

Scatter criteria

The given criterion function is optimized through iterative optimization

Page 33: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course33

Any Question?

End of Lecture 19

Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1

Page 34: Clustering I - Sharif

Machine Learning

Clustering IIHamid R. Rabiee

[Slides are based on Bishop Book]

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1

Page 35: Clustering I - Sharif

Problem of identifying groups, or clusters, of data points in a

multidimensional space

Partitioning the data set into some number K of clusters

Cluster: a group of data points whose inter-point distances are small

compared with the distances to points outside of the cluster

Goal: an assignment of data points to clusters such that the sum of the

squares of the distances to each data point to its closest vector (the center

of the cluster) is a minimum

Objective function called distortion measure:

K-means Clustering

2 Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 36: Clustering I - Sharif

3

K-means Clustering

Two-stage optimization

In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed

In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed

The mean of all of the data points assigned to cluster k

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 37: Clustering I - Sharif

4

K-means Clustering

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 38: Clustering I - Sharif

5

Mixtures of Gaussians

Gaussian mixture distribution can be written as a linear superposition of Gaussian

An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture model

A binary random variable z having a 1-of-K representation

The marginal distribution of x is a Gaussian mixture of the form (*) ( for

every observed data point xn, there is a corresponding latent variable zn)

…(*)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 39: Clustering I - Sharif

6

Mixtures of Gaussians

γ(zk) can also be viewed as the responsibility that component k takes for

explaining the observation x

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 40: Clustering I - Sharif

Mixtures of Gaussians

Generating random samples distributed according to the Gaussian mixture

model

Generating a value for z, which denoted as from the marginal

distribution p(z) and then generate a value for x from the conditional

distribution

7

Page 41: Clustering I - Sharif

8

Mixtures of Gaussians

a. The three states of z, corresponding to the three components of the mixture, are

depicted in red, green, blue

b. The corresponding samples from the marginal distribution p(x)

c. The same samples in which the colors represent the value of the responsibilities

γ(znk) associated with data point

Illustrating the responsibilities by evaluating the posterior probability for each

component in the mixture distribution which this data set was generated

Page 42: Clustering I - Sharif

9

Maximum likelihood

Graphical representation of a

Gaussian mixture model for

a set of N i.i.d. data points

{xn}, with corresponding

latent points {zn}

The log of the likelihood function

….(*1)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 43: Clustering I - Sharif

10

Maximum likelihood

For simplicity, consider a Gaussian mixture whose components have covariance matrices given by

Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn

This data point will contribute a term in the likelihood function of the form

Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood over-fitting problem

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 44: Clustering I - Sharif

11

Maximum likelihood

Over-fitting problem

In applying maximum likelihood to a Gaussian mixture models, there

should be steps to avoid finding such pathological solutions and instead

seek local minima of the likelihood function that are well behaved

Identifiability problem

A K-component mixture will have a total of K! equivalent solutions

corresponding to the K! ways of assigning K sets of parameters to K

components

Difficulty of maximizing the log likelihood function the presence of the summation

over k that appears inside the logarithm gives no closed form solution as in the single

case

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 45: Clustering I - Sharif

12

EM for Gaussian mixtures

I. Assign some initial values for the means, covariances, and mixing coefficients

II. Expectation or E step

• Using the current value for the parameters to evaluate the posterior probabilities

or responsibilities

III. Maximization or M step

• Using the result of II to re-estimate the means, covariances, and mixing coefficients

It is common to run the K-means algorithm in order to find a suitable initial values

• The covariance matrices the sample covariances of the clusters found by the K-

means algorithm

• Mixing coefficients the fractions of data points assigned to the respective

clusters

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 46: Clustering I - Sharif

13

EM for Gaussian mixtures

Given a Gaussian mixture model, the goal is to maximize the likelihood

function with respect to the parameters

1. Initialize the means μk, covariance Σk and mixing coefficients πk

2. E step

3. M step

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 47: Clustering I - Sharif

14

EM for Gaussian mixtures

Given a Gaussian mixture model, the goal is to maximize the likelihood

function with respect to the parameters

4: Evaluate the log likelihood

…(*2)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 48: Clustering I - Sharif

15

EM for Gaussian mixtures

Setting the derivatives of (*2) with respect to the means of the

Gaussian components to zero

Setting the derivatives of (*2) with respect to the covariance of

the Gaussian components to zero

Responsibilityγ(znk )

A weighted mean of all of the points in the data

set

- Each data point weighted by the corresponding posterior probability

- The denominator given by the effective # of points associated with the corresponding component

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 49: Clustering I - Sharif

16

EM for Gaussian mixtures

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 50: Clustering I - Sharif

17

An Alternative View of EM

In maximizing the log likelihood function

the summation prevents the logarithm from acting directly on the

joint distribution

Instead, the log likelihood function for the complete data set {X, Z} is

straightforward.

In practice since we are not given the complete data set, we consider

instead its expected value Q under the posterior distribution p( Z|X,

Θ) of the latent variable

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 51: Clustering I - Sharif

18

An Alternative View of EM

General EM

1. Choose an initial setting for the parameters Θold

2. E step Evaluate p(Z|X,Θold )

3. M step Evaluate Θnew given by

Θnew = argmaxΘQ(Θ ,Θold)

Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)

4. It the covariance criterion is not satisfied,

then let Θold Θnew

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 52: Clustering I - Sharif

19

Gaussian mixtures revisited

Maximizing the likelihood for the complete data {X, Z}

The logarithm acts directly on the Gaussian distribution much simpler solution to the maximum likelihood problem

the maximization with respect to a mean or a covariance is exactly as for a single Gaussian (closed form)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 53: Clustering I - Sharif

20

Gaussian mixtures revisited

Unknown latent variables considering expectation of the complete-data log

likelihood with respect to the posterior distribution of the latent variables

Posterior distribution

The expected value of the indicator variable under this posterior distribution

The expected value of the complete-data log likelihood function

…(*3)

1 1

( | , , , ) ( | , , , ) ( ) [ ( | , )]

( . (9.10), (9.11))

nk

N Kz

k n k k

n k

p Z X μ p X Z μ p Z N x

ref

1

( 1) ( | 1)( ) ( 1| )

( )

( | , )( )

( | , )

nk n nknk nk n

n

k n k knkK

j n j j

j

p z p x zE z p z x

p x

N xz

N x

k

1 1

[ln ( , , , )] ( ){ln ln ( | , )}N K

nk n k k

n k

E p X Z | μ z N x

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 54: Clustering I - Sharif

21

Relation to K-means

K-means performs a hard assignment of data points to the clusters (each data

point is associated uniquely with one cluster

EM makes a soft assignment based on the posterior probabilities

K-means can be derived as a particular limit of EM for Gaussian mixtures:

As epsilon gets smaller, the terms for which is farthest will go to zero

most quickly. Hence the responsibility go all zero except for the term k for which the

responsibility will go to unit

2

n jx

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 55: Clustering I - Sharif

22

Relation to K-means

Thus maximizing the expected complete data log-likelihood is equivalent to

minimizing the distortion measure J for the K-means

(In Elliptical K-means, the covariance is estimated also.)

2

2

2

2

1 1

exp{ / 2 }1 1( | , ) exp , ( )

22 exp{ / 2 }

1[ln ( , | , , )]

2

k n k

n k k n k nk

j n jj

N K

nk n k

n k

xp x I x z

x

E p X Z r x const

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 56: Clustering I - Sharif

23

(1 1 ) (2 2 )

Mixtures of Bernoulli distributions

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 57: Clustering I - Sharif

24

Mixtures of Bernoulli distributions

(1 )

1

1 1

(1 )

1 1

( | ) (1 ) (single component)

( ( , , ), ( ) ( , , ), ( ) { (1 )})

( | , ) ( | ), ( | ) (1 ) (mixture)

( { },

i i

i i

Dx x

i i

i

D D i i

DKx x

k k k ki ki

k i

p

x x E Cov diag

p p p

1 K k

x μ

x x μ x Σ

x μ x μ x μ

μ μ , ,μ μ

1

1 1

1 1

2

( ) ( ) ( )

( , , ))

( ) ( ) ,

( ) ( ) ( ) ( ) ( ( )) ( ) ( ) ( ) ( )

(Let ( )) ( ), then ( | ) , ( |

k kD

K K

k k k k

k k

K KT T T

k k k k

k k

k ij k ii k i k k ij k i j k

E E

Cov E E E E E E E E

E c c E x c E x x

T T T

k k

T

x x |μ μ

x xx x x xx |μ x x μ μ x x

xx |μ μ μ2

3 2

) ( )

and note that individual variabes are )

[ . .] (1, 0, 0, 1, 1)

(single component): ( ) : ( ) (1 ) , ( ) ( , , ), ( ) (1 )

(mixture) : (

k

i

i j

x independent, given μ

E g

p H p E Cov I

p H

x

x x x

3 2 3 2

1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 2 2

2 2 2

1 1 1 1 2 2 2 2 1 1 2 2

2 2

1 1 1 2 2 2 1 1 1 2 1 1

| ) , ( | ) : ( ) (1 ) (1 ) , ( ) ( , , )

( ) { (1 ) } { (1 ) } ( )

{ (1 ) (1 )} { (

c p H c p E

Cov I I

I

x x

x 1 1 1

2

2 2 ) }

* Because the covariance matrix Cov( ) is no longer diagonal, the mixture distribution can

capture correlations between the variables, unlike a single Bernulli distribution.

x

1

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 58: Clustering I - Sharif

25

Mixtures of Bernoulli distributions

1 1

T

1

1

1

1

ln ( | , ) ln ( | )

( | , ) ( | ) ( ( , ) is a binary indicator variables)

( )

(complete-data log likelihood function) :

ln ( , | , ) ln [

k

k

N K

k k

n k

Kz

K

k

Kz

k

k

K

nk k

k

p p

p p z , z

p

p z

n

k

x μ x μ

x z μ x μ z

z

X Z μ

π

π

π

|

1 1

Z

1 1 1

1

1

ln (1 )ln(1 )]

E [ln ( , | , )] ( ) ln [ ln (1 )ln(1 )]

( | ) (E-step) ( ) [ ] , ( ),

( | )

N D

ni ki ni ki

n i

N K D

nk k ni ki ni ki

n k i

Nk k

nk nk k nkKn

j j

j

x x

p z x x

pz E z N z

p

n

n

X Z μ

x μx

x μ

π

1

k

1( )

(M-step) ,

* In contrast to the mixture of Gaussians, there are no singularities in which the likelihood

goes to infinity

N

nk

nk

kk

zN

N

N

k n

k

x

μ x

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 59: Clustering I - Sharif

26

Mixtures of Bernoulli distributions

N=600 digit images, 3 mixtures

A mixture of k=3 Bernoulli distributions by 10 EM iterations

Parameters for each of the three components/single multivariate Bernoulli

The analysis of Bernoulli mixtures can be extended to the case of multinomial binary

variables having M>2 states (Ex.9.19)

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

Page 60: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course27

References

Slides of Chapter 9 of Bishop book adapted from Biointelligence

Laboratory, Seoul National University http://bi.snu.ac.kr/

Page 61: Clustering I - Sharif

Sharif University of Technology, Computer Engineering Department, Machine Learning Course28

Any Question?

End of Lecture 20

Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1