mldm2004s lecture-07-cluster analysis - ntnuberlin.csie.ntnu.edu.tw/pastcourses/2004s... · 2004....

44
Cluster Analysis Berlin Chen 2004 References: 1. Foundations of Statistical Natural Language Processing, Chapter 14 2. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 6 3. Modern Information Retrieval, chapters 5, 7

Upload: others

Post on 06-Mar-2021

3 views

Category:

Documents


1 download

TRANSCRIPT

  • Cluster Analysis

    Berlin Chen 2004

    References:1. Foundations of Statistical Natural Language Processing, Chapter 142. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 63. Modern Information Retrieval, chapters 5, 7

  • 2

    Clustering

    • Place similar objects in the same group and assign dissimilar objects to different groups– Word clustering

    • Neighbor overlap: words occur with the similar left and right neighbors (such as in and on)

    – Document clustering• Documents with the similar topics or concepts are put

    together

    • But clustering cannot give a comprehensive description of the object– How to label objects shown on the visual display

    • Clustering is a way of learning

  • 3

    Clustering vs. Classification

    • Classification is supervised and requires a set of labeled training instances for each group (class)

    • Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set

    – Also called automatic or unsupervised classification

  • 4

    Types of Clustering Algorithms

    • Two types of structures produced by clustering algorithms– Flat or non-hierarchical clustering– Hierarchical clustering

    • Flat clustering– Simply consisting of a certain number of clusters and the relation

    between clusters is often undetermined

    • Hierarchical clustering– A hierarchy with usual interpretation that each node stands for a

    subclass of its mother’s node• The leaves of the tree are the single objects• Each node represents the cluster that contains all the objects

    of its descendants

  • 5

    Hard Assignment vs. Soft Assignment

    • Another important distinction between clustering algorithms is whether they perform soft or hard assignment

    • Hard Assignment– Each object is assigned to one and only one cluster

    • Soft Assignment– Each object may be assigned to multiple clusters– An object has a probability distribution over

    clusters where is the probability that is a member of

    – Is somewhat more appropriate in many tasks such as NLP, IR, …

    ix ( )ixP ⋅jc

    jc( )ji cxP ix

  • 6

    Hard Assignment vs. Soft Assignment

    • Hierarchical clustering usually adopts hard assignment

    • While in flat clustering both types of clustering are common

  • 7

    Summarized Attributes of Clustering Algorithms • Hierarchical Clustering

    – Preferable for detailed data analysis

    – Provide more information than flat clustering

    – No single best algorithm (each of the algorithms only optimal for some applications)

    – Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

  • 8

    Summarized Attributes of Clustering Algorithms

    • Flat Clustering– Preferable if efficiency is a consideration or data sets are very

    large

    – K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

    – K-means assumes a simple Euclidean representation space, and so cannot be used for many data sets, e.g., nominal data like colors

    – The EM algorithm is the most choice. It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

  • 9

    Hierarchical Clustering

  • 10

    Hierarchical Clustering

    • Can be in either bottom-up or top-down manners– Bottom-up (agglomerative)

    • Start with individual objects and grouping the most similar ones

    – E.g., with the minimum distance apart

    • The procedure terminates when one cluster containing all objects has been formed

    – Top-down (divisive)• Start with all objects in a group and divide them into groups

    so as to maximize within-group similarity

    ( ) ( )yxdyxsim ,11,

    +=

    凝集的

    分裂的

    distance measures willbe discussed later on

  • 11

    Hierarchical Agglomerative Clustering (HAC)

    • A bottom-up approach

    • Assume a similarity measure for determining the similarity of two objects

    • Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

    • The history of merging/clustering forms a binary tree or hierarchy

  • 12

    Hierarchical Agglomerative Clustering (HAC)

    • Algorithm

    cluster number

    Initialization (for tree leaves):Each object is a cluster

    merged as a new cluster

    The original two clusters are removed

  • 13

    Distance Metrics

    • Euclidian Distance (L2 norm)

    • L1 Norm

    • Cosine Similarity (transform to a distance by subtracting from 1)

    2

    12 )(),( i

    m

    ii yxyxL −=∑

    =

    rr

    ∑=

    −=m

    iii yxyxL

    11 ),(

    rr

    yxyxrr

    rr

    ⋅−

    •1 ranged between 0 and 1

  • 14

    Measures of Cluster Similarity• Especially for the bottom-up approaches

    • Single-link clustering– The similarity between two clusters is the similarity of the two

    closest objects in the clusters

    – Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

    Cu Cv

    greatest similarity

    ( ) ( )y,xsim,ccsimji cy,cx

    ji

    rrrr∈∈

    = max

  • 15

    Measures of Cluster Similarity

    • Complete-link clustering– The similarity between two clusters is the similarity of their two

    most dissimilar members

    – Sphere-shaped clusters are achieved

    – Preferable for most IR and NLP applications

    Cu Cv

    least similarity

    ( ) ( )y,xsim,ccsimji cy,cxji

    rrrr∈∈

    = min

  • 16

    Measures of Cluster Similarity

    single link

    complete link

  • 17

    Measures of Cluster Similarity

    • Group-average agglomerative clustering– A compromise between single-link and complete-link clustering

    – The similarity between two clusters is the average similarity between members

    – If the objects are represented as length-normalized vectors and the similarity measure is the cosine

    • There exists an fast algorithm for computing the average similarity

    ( ) ( ) yxyxyxyxyxsim

    rrrr

    rrrrrr

    ⋅=⋅

    == ,cos,

    length-normalized vectors

  • 18

    Measures of Cluster Similarity

    • Group-average agglomerative clustering (cont.)

    – The average similarity SIM between vectors in a cluster cj is defined as

    – The sum of members in a cluster cj :

    – Express in terms of

    ( ) ( ) ( ) ( )∑ ∑∑ ∑ ∈≠∈∈

    ≠∈

    ⋅−

    =−

    =j jj j cx

    xycyjjcx

    xycyjj

    j yxccyxsim

    cccSIM

    rrr

    rrrr

    r

    rrrr

    11,

    11

    ( ) ∑∈

    =jcx

    j xcs rrr

    ( )jcSIM ( )jcsr

    ( ) ( ) ( )( ) ( )( ) ( )

    ( ) ( ) ( )( )1

    1

    1

    −⋅=∴

    +−=

    ⋅+−=

    ⋅=⋅=⋅

    ∑ ∑∑

    ∈ ∈∈

    jj

    jjj

    j

    jjjj

    cxjjj

    cx cyj

    cxjj

    ccccscs

    cSIM

    ccSIMcc

    xxcSIMcc

    yxcsxcscs

    j

    j jj

    rr

    rr

    rrrrrr

    r

    r rr

    =1

    length-normalized vector

  • 19

    Measures of Cluster Similarity

    • Group-average agglomerative clustering (cont.)

    -As merging two clusters ci and cj , the cluster sum vectors and are known in advance

    – The average similarity for their union will be

    ( )icsr ( )jcsr

    ( )( ) ( )( ) ( ) ( )( ) ( )

    ( )( )1 −+++−+⋅+

    =∪

    jiji

    jijiji

    ji

    cccccccscscscs

    ccSIMrrrr

    ( ) ( ) ( ) jiNewjiNew ccccscscs +=+= ,rrr

  • 20

    Example: Word Clustering

    • Words (objects) are described and clustered using a set of features and values– E.g., the left and right neighbors of tokens of words

    “be” has least similarity with the other 21 words !

    higher nodes:decreasingof similarity

  • 21

    Divisive Clustering

    • A top-down approach

    • Start with all objects in a single cluster

    • At each iteration, select the least coherent cluster and split it

    • Continue the iterations until a predefined criterion (e.g., the cluster number) is achieved

    • The history of clustering forms a binary tree or hierarchy

  • 22

    Divisive Clustering

    • To select the least coherent cluster, the measures used in bottom-up clustering can be used again here– Single link measure– Complete-link measure– Group-average measure

    • How to split a cluster– Also is a clustering task (finding two sub-clusters)– Any clustering algorithm can be used for the splitting operation,

    e.g.,• Bottom-up (agglomerative) algorithms• Non-hierarchical clustering algorithms (e.g., K-means)

  • 23

    Divisive Clustering

    • Algorithm

    :split the least coherent cluster

    Generate two new clusters and remove the original one

  • 24

    Non-Hierarchical Clustering

  • 25

    Non-hierarchical Clustering

    • Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partition– In a multi-pass manner

    • Problems associated non-hierarchical clustering– When to stop– What is the right number of clusters

    • Algorithms introduced here– The K-means algorithm– The EM algorithm

    MI, group average similarity, likelihood

    k-1 → k → k+1

    Hierarchical clustering also has to face this problem

  • 26

    The K-means Algorithm

    • A hard clustering algorithm

    • Define clusters by the center of mass of their members

    • Initialization– A set of initial cluster centers is needed

    • Recursion– Assign each object to the cluster whose center is closet – Then, re-compute the center of each cluster as the centroid or

    mean (average) of its members• Using the medoid as the cluster center ?(a medoid is one of the objects in the cluster)

  • 27

    The K-means Algorithm

    • Algorithm

    cluster centroid

    cluster assignment

    calculation of new centroids

  • 28

    The K-means Algorithm

    • Example 1

  • 29

    The K-means Algorithm

    • Example 2

    governmentfinancesports

    research

    name

    yxyxrr

    rr

    ⋅−

    •1

  • 30

    The K-means Algorithm

    • Choice of initial cluster centers (seeds) is important

    – Pick at random– Or use another method such as hierarchical clustering algorithm

    on a subset of the objects• E.g., buckshot algorithm uses the group-average

    agglomerative clustering to randomly sample of the data that has size square root of the complete set

    – Poor seeds will result in sub-optimal clustering

    • How to break ties in case there are several centers with the same distance from an object

    – Randomly assign the object to one of the candidate clusters– Or, perturb objects slightly

  • 31

    The K-means Algorithm

    • E.g., the LBG algorithm– By Linde, Buzo, and Gray

    Global mean Cluster 1 mean

    Cluster 2mean

    {µ11,Σ11,ω11}{µ12,Σ12,ω12}

    {µ13,Σ13,ω13} {µ14,Σ14,ω14}

    M→2M at each iteration

  • 32

    The EM Algorithm• A soft version of the K-mean algorithm

    – Each object could be the member of multiple clusters– Clustering as estimating a mixture of (continuous) probability

    distributions

    ∑ixr

    ( )1cxP i( )11 cP=π

    ( )22 cP=π

    ( )KK cP=π

    ( )2cxP i

    ( )Ki cxP

    ( ) ( ) ( )∑=

    =ΘK

    kkkii cP;cxPxP

    1

    ΘΘ rr

    ( )( )

    ( ) ( )⎟⎠⎞

    ⎜⎝⎛ −Σ−−

    Σ= − kik

    Tki

    kmki

    xxcxP µµπ

    rrrrr 121exp

    2

    1Θ;

    Continuous case:Likelihood function fordata samples:

    ( ) ( )

    ( ) ( ) cP;cxP

    xPP

    n

    i

    K

    kkki

    n

    ii

    i

    ii∏ ∑

    = =

    =

    =

    =

    1 1

    1

    ΘΘ

    ΘΘ

    r

    rX

    A Mixture Gaussian HMM(or A Mixture of Gaussians)

    xxx nr

    Krr ,,, 21=X

    ( ) ( ) ( )( )( ) ( )ΘΘmax

    ΘΘmax ,max

    :instance new of Prediction

    kkik

    i

    kki

    ki

    k

    cP;cx

    xPcP;cx

    xcP

    r

    r

    rr

    =

    Θ=Θ

    (i.i.d.) ddistributey identicallt independen are s'ixr

  • 33

    The EM Algorithm

  • 34

    The EM Algorithm

    • E–step (Expectation)– Derive the complete data likelihood function

    ( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

    ( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

    ( )[ ]( )[ ]

    ( )[ ]∑=∑ ∑∑=

    ∑ ∑∑=

    ∑ ∑ ∏∑=

    ∑ ∑ ∏∑=

    ++×

    ×++=

    ∏ ∑=∏=

    = ==

    = ==

    = = ==

    = = ==

    = ==

    CCX

    X

    Θ̂,

    Θ̂,,,,,,

    Θ̂,,,,,,

    Θ̂,

    Θ̂Θ̂

    Θ̂Θ̂..Θ̂Θ̂

    Θ̂Θ̂..Θ̂Θ̂

    Θ̂Θ̂Θ̂Θ̂

    1 121

    1

    1 121

    1

    1 1 11

    1 1 11

    11

    1111

    1 11

    121

    2

    121

    2

    1 2

    1 2

    P

    cxcxcxP

    cxcxcxP

    cxP

    cP;cxP

    cP;cxPcP;cxP

    cP;cxPcP;cxP

    cP;cxP xPP

    K

    k

    K

    kknkk

    K

    k

    K

    k

    K

    kknkk

    K

    k

    K

    k

    K

    k

    n

    iki

    K

    k

    K

    k

    K

    k

    n

    ikki

    K

    k

    KKnn

    KK

    n

    i

    K

    kkki

    n

    ii

    nn

    nn

    ni

    nii

    iii

    rL

    rrL

    rL

    rrL

    rL

    rL

    rr

    Lrr

    rr

    nn kkkk

    nn

    ccccxxxx

    121

    121

    −== −

    L

    rrL

    rr

    CX

    the complete data likelihood function

    likelihood function

    )kinds?( ofkindsmany How

    nKC

    xxx nr

    Krr ,,, 21=X

    ?

    iika

    ( )( )( ) ( )∑ ∑ ∑ ∏=

    +++++++++=

    ∏ ∑

    = = = =

    = =

    Kk

    Kk

    Kk

    Tt tk

    TKTTKK

    Tt

    Kk tk

    T t

    t t

    a

    aaaaaaaaa

    a

    1 1 1 1

    212222111211

    1 1

    1 2...

    ............

    :Note

  • 35

    The EM Algorithm

    • E–step (Expectation)– Define the auxiliary function as the expectation of the

    log complete likelihood function LCM with respective to thehidden/latent variable C conditioned on known data

    – Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

    • We have shown this when deriving the HMM-based retrieval model

    ( )Θ̂Θ,Φ

    ( )Θ,X

    ( ) [ ] ( )[ ]( ) ( )( )( ) ( )∑

    =

    =

    ==Φ

    C

    C

    XCXC

    CXXCX

    CXXC

    CX

    Θ̂,logΘΘ,

    Θ̂,logΘ,

    Θ̂,loglogΘ̂,ΘΘ,Θ,

    PP

    P

    PP

    PELE CM

    ( )Θ̂log XP( )Θ̂Θ,Φ

  • 36

    The EM Algorithm• E–step (Expectation)

    – The auxiliary function ( )Θ̂Θ,Φ( ) ( )( ) ( )

    ( )( ) ( )

    ( ) ( )

    ( ) ( )

    ( )[ ] ( )

    ( )[ ] ( ){ }( ) ( ){ }( ) ( ) ( )[ ]{ }( ) ( ){ } ( ) ( ){ }∑ ∑ ∑ ∑+=

    ∑ ∑=

    ∑ ∑=

    ∑ ∑=

    ∑ ∑⎪⎭

    ⎪⎬⎫

    ⎪⎩

    ⎪⎨⎧

    ∑ ⎥⎦

    ⎤⎢⎣

    ⎡∏=

    ∑ ∑⎪⎭

    ⎪⎬⎫

    ⎪⎩

    ⎪⎨⎧

    ⎥⎦

    ⎤⎢⎣

    ⎡∏⎥⎦

    ⎤⎢⎣⎡∑=

    ∑ ⎥⎦⎤

    ⎢⎣⎡∑⎥

    ⎤⎢⎣

    ⎡∏=

    ∑ ⎥⎦

    ⎤⎢⎣

    ⎡∏

    ⎥⎥⎦

    ⎢⎢⎣

    ⎡∏=

    ∑=Φ

    = = = =

    = =

    = =

    = =

    = = = =

    = = ==

    = ==

    = ==

    K

    k

    n

    i

    K

    k

    n

    ikijkkjk

    K

    k

    n

    ikkijk

    K

    k

    n

    ikijk

    K

    k

    n

    iikki

    K

    k

    n

    i ccc

    n

    jjkkkki

    ccc

    K

    k

    n

    jjkki

    n

    ikk

    cccki

    n

    i

    n

    jjk

    ccc

    n

    iki

    n

    j j

    kj

    cxPxcPcPxcP

    cPcxPxcP

    cxPxcP

    xcPcxP

    xcPcxP

    xcPcxP

    cxPxcP

    cxPxP

    cxP

    PP

    P

    jj

    j

    j

    nkkkji

    nkkkji

    nkkkij

    nkkki

    j

    1 1 1 1

    1 1

    1 1

    1 1

    1 1 1,

    1 11,

    11

    11

    Θ̂,logΘ,Θ̂logΘ,

    Θ̂Θ̂,logΘ,

    Θ̂,logΘ,

    Θ,Θ̂,log

    Θ,Θ̂,log

    Θ,Θ̂,log

    Θ̂,logΘ,

    Θ̂,logΘ

    Θ,

    Θ̂,logΘΘ,

    Θ̂,Θ

    21

    21

    21

    21

    rrr

    rr

    rr

    rr

    rr

    rr

    rr

    rr

    r

    K

    K

    K

    K

    C

    C

    C

    C

    CCX

    XCX

    δ

    δ

    ?

    ⎩⎨⎧ =

    =otherwise 0

    if 1,

    kk ikk i

    δ

    data will be aligned to a specific mixture

    each mixture collects thecorresponding training data

  • 37

    The EM Algorithm

    – Note that

    ( )

    ( )[ ]

    ( ) ( )

    ( )

    ( )Θ,

    Θ,1

    Θ,Θ,

    Θ,

    Θ,

    ,1

    1,

    ,1 1

    1 1 1 1,

    1,

    1 2

    21

    ik

    ikn

    ijj

    K

    cikkk

    n

    ijj

    K

    kjk

    K

    c

    K

    c

    K

    c

    n

    jjkkk

    ccc

    n

    jjkkk

    xcP

    xcP

    xcPxcP

    xcP

    xcP

    ikii

    jj

    k k nkji

    nkkkji

    r

    r

    rr

    rL

    r

    K

    =

    ⎥⎦

    ⎤⎢⎣

    ⎡∏=

    ⎥⎥⎦

    ⎢⎢⎣

    ⎡∑

    ⎥⎥⎦

    ⎢⎢⎣

    ⎡∏

    ⎥⎥⎦

    ⎢⎢⎣

    ⎡∑=

    ∑ ∑ ∑ ∏=

    ∑ ⎥⎦

    ⎤⎢⎣

    ⎡∏

    ≠=

    =≠= =

    = = = =

    = =

    δ

    δ

    δC

    ( )( )( ) ( )∑ ∑ ∑ ∏=

    +++++++++=

    ∏ ∑

    = = = =

    = =

    Kk

    Kk

    Kk

    Tt tk

    TKTTKK

    Tt

    Kk tk

    T t

    t t

    a

    aaaaaaaaa

    a

    1 1 1 1

    212222111211

    1 1

    1 2...

    ............

    :Note

    ki cx toaligned beonly can r

    ⎩⎨⎧ =

    =otherwise 0

    if 1,

    kkikk i

    δ

  • 38

    The EM Algorithm

    • E–step (Expectation)– The auxiliary function can also be divided into two:

    ( ) ( ) ( )

    ( ) ( ) ( )( ) ( )

    ( ) ( )( ) ( )( ) ( )

    ( )

    ( ) ( ) ( )( ) ( )( ) ( )

    ( )∑ ∑∑

    ∑ ∑

    ∑ ∑∑

    ∑ ∑

    ∑ ∑

    = =

    =

    = =

    = =

    =

    = =

    = =

    =

    =

    =

    Φ+Φ=Φ

    n

    i

    K

    kkiK

    llli

    kki

    n

    i

    K

    kkiikb

    n

    i

    K

    kkK

    llli

    kki

    n

    i

    K

    kk

    i

    kki

    n

    i

    K

    kkika

    ba

    cxPcPcxP

    cPcxP

    cxPxcP

    cPcPcxP

    cPcxP

    cPxP

    cPcxP

    cPxcP

    1 1

    1

    1 1

    1 1

    1

    1 1

    1 1

    Θ̂,logΘΘ,

    ΘΘ,

    Θ̂,logΘ,Θ̂Θ,

    Θ̂logΘΘ,

    ΘΘ,

    Θ̂logΘ

    ΘΘ,

    Θ̂logΘ,Θ̂Θ,

    whereΘ̂Θ,Θ̂Θ,Θ̂Θ,

    r

    r

    r

    rr

    r

    r

    r

    r

    r

    auxiliary function for mixture weights

    auxiliary function for cluster distributions

  • 39

    The EM Algorithm

    • M-step (Maximization)– Remember that

    • Maximize a function F by applying Lagrange multiplier

    ∑∑∑

    ∑ ∑∑

    =

    ===

    = ==

    =∴

    −=⇒−=

    ∀−=⇒=+=

    ⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛−+=⇒=

    N

    jj

    jj

    N

    jj

    N

    jj

    N

    jj

    j

    j

    j

    j

    j

    N

    j

    N

    jjjj

    N

    jjj

    w

    wy

    wwy

    jyw

    yw

    yF

    yywFywF

    1

    111

    1 11

    1logˆlog that Suppose

    Multiplier Lagrange applyingBy

    ll

    ll

    l

    l

    ∂∂

    Constraint

    jj

    j

    yyy 1log:Note

    =∂

  • 40

    The EM Algorithm

    • M-step (Maximization)– Maximize ( )Θ̂Θ,aΦ

    ( ) ( ) ( )( ) ( )( ) ( )

    ( ) ( ) 1Θ̂Θ̂logΘΘ,

    ΘΘ,

    1Θ̂Θ̂Θ,Θ̂Θ,

    11 1

    1

    1

    ⎟⎠

    ⎞⎜⎝

    ⎛−+=

    ⎟⎠

    ⎞⎜⎝

    ⎛−+Φ=Φ

    ∑∑ ∑∑

    == =

    =

    =

    K

    kk

    K

    k

    n

    ikK

    llli

    kki

    K

    kkaa

    cPlcPcPcxP

    cPcxP

    cPl

    r

    r

    kw ky

    ( )

    ( ) ( )( ) ( )( ) ( )( ) ( )

    ( ) ( )( ) ( )

    ΘΘ,

    ΘΘ,

    ΘΘ,

    ΘΘ,

    ΘΘ,

    ΘΘ,

    Θ̂ˆ

    1

    1

    1 1

    1

    1

    1

    1

    n

    cPcxP

    cPcxP

    cPcxP

    cPcxP

    cPcxP

    cPcxP

    w

    wcP

    n

    iK

    llli

    kki

    K

    k

    n

    iK

    llli

    kki

    n

    iK

    llli

    kki

    K

    kk

    kkk

    ∑∑

    ∑ ∑∑

    ∑∑

    =

    =

    = =

    =

    =

    =

    =

    ====⇒

    r

    r

    r

    r

    r

    r

    π

    auxiliary function for mixture weights (or priors for Gaussians)

  • 41

    The EM Algorithm

    • M-step (Maximization)– Maximize ( )Θ̂Θ,bΦ

    ( ) ( ) ( )( ) ( )

    ( )∑ ∑∑= ==

    =Φn

    i

    K

    kkiK

    llli

    kkib cxP

    cPcxP

    cPcxP

    1 1

    1

    Θ̂,logΘΘ,

    ΘΘ, Θ̂Θ, r

    r

    r

    auxiliary function for Gaussian Means and Variances

    ( )( )

    ( ) ( )⎟⎠⎞

    ⎜⎝⎛ −Σ−−

    Σ=Θ − ki

    Tki

    kmki

    xxcxPk

    µµπ

    rrrrr 121exp

    2

    1;

    ( ) ( )( ) ( )

    and ΘΘ,

    ΘΘ,Let

    1

    ,

    ∑=

    = K

    llli

    kkiik

    cPcxP

    cPcxPw

    r

    r ( )( ) ( ) ( )kiTkik

    ki

    xxm

    cxP

    kµµπ rrrr

    r

    −Σ−−Σ−⋅−

    −1

    21log2

    12log2

    ;log

    ( ) ( ) ( )∑ ∑= =

    − +⎥⎦⎤

    ⎢⎣⎡ −Σ−+Σ−=Φ

    n

    i

    K

    kkik

    T

    kikikb Dxxw1 1

    1,

    ˆˆˆ2

    1log21 Θ̂Θ, µµ rrrr

    constant

  • 42

    The EM Algorithm

    • M-step (Maximization)– Maximize with respect to

    ( ) ( ) ( )( )( ) ( )

    ( ) ( )( ) ( )( ) ( )

    ∑∑

    ∑∑

    =

    =

    =

    =

    =

    =

    =

    ΘΘ

    ΘΘ

    ΘΘ

    ⋅ΘΘ

    =⋅

    =⇒

    =−−Σ⋅⋅⋅−=∂

    Φ∂

    n

    iK

    llli

    kki

    n

    iK

    llli

    ikki

    n

    iik

    n

    iiik

    k

    n

    ikikik

    k

    b

    cPcxP

    cPcxP

    cPcxP

    xcPcxP

    w

    xw

    xw

    1

    1

    1

    1

    1,

    1,

    1

    1,

    ,

    ,

    ,

    ,

    ˆ

    01ˆˆ221 ˆ

    Θ̂Θ,

    r

    r

    r

    rr

    r

    r

    rrr

    µ

    µµ

    ( )Θ̂Θ,bΦ kµr

    ( ) ( ) ( )∑ ∑= =

    − +⎥⎦⎤

    ⎢⎣⎡ −Σ−+Σ−=Φ

    n

    i

    K

    kkik

    T

    kikikb Dxxw1 1

    1,

    ˆˆˆ2

    1ˆlog21 Θ̂Θ, µµ rrrr

    ( )here symmetric is and

    )(

    1−

    +=

    kΣd

    d xCCxCxx TT

  • 43

    The EM Algorithm

    • M-step (Maximization)– Maximize with respect to( )Θ̂Θ,bΦ

    ( )[ ]

    and

    )det(det TXXXX −⋅=

    dd

    ( ) ( ) ( )∑ ∑= =

    − +⎥⎦⎤

    ⎢⎣⎡ −Σ−+Σ−=Φ

    n

    i

    K

    kkik

    T

    kikikb Dxxw1 1

    1,

    ˆˆˆ2

    1ˆlog21 Θ̂Θ, µµ rrrr

    ( ) ( )( )

    ( )( )

    ( )( )

    ( )( )

    ( )( )( ) ( )( ) ( )

    ( )( )

    ( ) ( )( ) ( )

    ∑∑

    ∑∑

    ∑∑

    ∑∑

    ∑∑

    =

    =

    =

    =

    =

    =

    ==

    =

    −−

    =

    =

    −−

    =

    =

    −−−−

    ΘΘ

    ΘΘ

    −−⋅ΘΘ

    ΘΘ

    =−−⋅

    =Σ⇒

    −−⋅=Σ⋅⇒

    ΣΣ−−ΣΣ⋅=ΣΣΣ⋅⇒

    Σ−−Σ⋅=Σ⋅⇒

    =⎥⎦⎤

    ⎢⎣⎡ Σ−−Σ−Σ⋅Σ⋅Σ⋅−=

    Σ∂Φ∂

    n

    iK

    llli

    kki

    n

    i

    T

    kikiK

    llli

    kki

    n

    iik

    n

    i

    T

    kikiik

    k

    n

    i

    T

    kikiik

    n

    ikik

    k

    n

    ik

    T

    kikikkik

    n

    ikkkik

    n

    ik

    T

    kikikik

    n

    ikik

    n

    ik

    T

    kikikkkkikk

    b

    cPcxP

    cPcxP

    xxcPcxP

    cPcxP

    w

    xxw

    xxww

    xxww

    xxww

    xxw

    1

    1

    1

    1

    1,

    1,

    1,

    1,

    1

    11,

    1

    1,

    1

    11,

    1

    1,

    1

    1111

    ,

    ,

    ,

    ˆˆ,

    ,

    ˆˆˆ

    ˆˆˆ

    ˆˆˆˆˆˆˆˆˆ

    ˆˆˆˆˆ

    0ˆˆˆˆˆ21 ˆ

    Θ̂Θ,

    r

    r

    rrrr

    r

    r

    rrrr

    rrrr

    rrrr

    rrrr

    rrrr

    µµµµ

    µµ

    µµ

    µµ

    µµ

    111 )( −−− −= XabX

    XbXa TT

    dd

  • 44

    The EM Algorithm

    • The initial cluster distributions can be estimated using the K-means algorithm

    • The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

    ( )ΘXP

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

    /Description >>> setdistillerparams> setpagedevice