158 9(clustering)

36
Clustering adapted from: Doug Downey and Bryan Pardo, Northwestern University  

Upload: hendy-kurniawan

Post on 01-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 1/36

Clustering 

adapted from:

Doug Downey and Bryan Pardo, Northwestern University  

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 2/36

Bagging!  Use bootstrapping to generate L training sets

and train one base-learner with each

(Breiman, 1996)!  Use voting

!  Unstable algorithms profit from bagging

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 3/36

Boosting!  Given a large training set, randomly divide it into 3

sets (X1, X2, and X3)

!  Use X1 to train D1

!  Test D1 with X2

!  Training set for D2 = Take all instances from X2misclassified by D1 (and also as many instances onwhich D1 is correct from X2)

!  Test D1 and D2 with X3!

  Training set for D3 = The instances from X3 onwhich D1 and D2 disagree

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 4/36

AdaBoost

Generate a

sequence of

base-learners

each focusing

on previous

one’s errors

(Freund and

Schapire,

1996)

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 5/36

Mixture of Experts

Voting where weights are input-dependent (gating)

(Jacobs et al., 1991)!=

=

 j   j  j 

d w  y 1

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 6/36

Stacking!  Combiner f () is

another learner

(Wolpert, 1992)

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 7/36

CascadingUse d  j  only if

preceding ones

are not confident

Cascade

learners in order

of complexity

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 8/36

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 9/36

Clustering

!  Grouping data into (hopefully useful) sets.

Things on the rightThings on the left

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 10/36

Clustering!

  Unsupervised Learning!

 

No labels

!  Why do clustering?!  Labeling is costly

!  Data pre-processing

!  Text Classification (e.g., search engines, Google Sets)

!  Hypothesis Generation/Data Understanding

!  Clusters might suggest natural groups.

!  Visualization

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 11/36

Some definitions

!  Let  X  be the dataset:

!   An m-clustering  of   X  is a partition of  X  into m

sets (clusters) C1,…Cm such that: 

},...,,{ 321   n x x x x X    =

 if {}, :overlapnotdoClusters .3

 :Xof allcoverClusters .2

 1{}, :empty-nonareClusters .1

1

i jC C 

 X C 

miC 

 ji

i

m

i

i

!="

=

##!

=!

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 12/36

How many possible clusterings?

(Stirling numbers)

S (n,m) = 1m!

("1)m"i

i=0

m

#  m

i

% & 

( ) in

Size of

dataset

Number

of clusters

6810)5,100(

901,115,232,45)4,20(

101,375,2)3,15(

!

=

=

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 13/36

What does this mean?

!  We can’t try all possible clusterings.

!  Clustering algorithms look at a small fractionof all partitions of the data.

The exact partitions tried depend on the kindof clustering used.

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 14/36

Who is right?!

  Different techniques cluster the same dataset DIFFERENTLY.

!  Who is right? Is there a “right” clustering?

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 15/36

Classic Example: Half Moons

From Batra et al., http://www.cs.cmu.edu/~rahuls/pub/bmvc2008-clustering-rahuls.pdf  

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 16/36

Steps in Clustering

!  Select Features

!  Define a Proximity Measure

!  Define Clustering Criterion!  Define a Clustering Algorithm

!  Validate the Results

!  Interpret the Results

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 17/36

Kinds of Clustering

!  Sequential!  Fast

!  Cost Optimization!

  Fixed number of clusters (typically)

!  Hierarchical!

 

Start with many clusters

 join clusters at each step

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 18/36

m =1

C 1 = { x1}

For i =  2 to n

  Find C k  :d ( xi ,C k )=

min" j  d ( xi,C  j )

  If (d ( xi,C k ) >#) and (m < q)

  m =  m+1

  C m  =  { x i}

  Else

C k   =  C k ${ x i}

  End

End

d(x,C) = the distance between feature

vector x and cluster C.

! = the threshold of dissimilarity

q = the maximum number of clusters

n = the number of data points

!  Basic Sequential Algorithmic

Scheme (BSAS)S. Theodoridis and K. Koutroumbas, Pattern

Recognition, Academic Press, London England, 1999

!   Assumption: The number of

clusters is not known in advance.

A Sequential Clustering Method

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 19/36

A Cost-optimization method

!  K-means clustering!  J. B. MacQueen (1967): "Some Methods for classification and Analysis of

Multivariate Observations, Proceedings of 5-th Berkeley Symposium on

Mathematical Statistics and Probability" , Berkeley, University of California Press

1:281-297 !  A greedy algorithm

!  Partitions n examples into k  clusters

minimizes the sum of the squared distancesto the cluster centers

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 20/36

The K-means algorithm

1.  Place K points into the space represented by theobjects that are being clustered. These pointsrepresent initial group centroids (means).

2.   Assign each object to the group that has the closest

centroid (mean).

3.  When all objects have been assigned, recalculate thepositions of the K centroids (means).

4.  Repeat Steps 2 and 3 until the centroids no longer

move.

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 21/36

K-means clustering

!  The way to initialize the mean values is not specified.

!  Randomly choose k samples?

!  Results depend on the initial means

!  Try multiple starting points?

!   Assumes K is known.

!  How do we chose this?

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 22/36

k-Means Clustering

!  Find k  reference vectors (centroids) whichbest represent data

!  Reference vectors, m j , j =1,...,k  

!

  Use nearest (most similar) reference:

xt =>miwheremi has min

 j x

t "m j 

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 23/36

Encoding/Decoding

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 24/36

Reconstruction Error

{ }

!"

!

#

$   %=%

=

%= & &=

otherwise0

minif1

1

 j t 

 j i 

t t 

t i    i t t 

i k i i 

b E 

m x m x 

m x m    X 

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 25/36

k-means Clustering

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 26/36

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 27/36

Leader Cluster Algorithm!  Instance far away from all centroids (dist >

threshold) => becomes a new centroid

!  Cluster that covers a large number ofinstances (num > threshold) => split into 2

clusters

!  Cluster that covers too few instances (num <

threshold) can be removed (and perhapsrandomly assigned to another random data

point)

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 28/36

Choosing K

!  Defined by the application, e.g., image

quantization

!  PCA

!  Incremental (leader-cluster) algorithm: Add

one at a time until “elbow” (reconstruction

error)!  Manual check for meaning

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 29/36

Supervised Learning After

Clustering

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 30/36

Naïve Bayes Mood Classifier

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 31/36

Training Data

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 32/36

Human Powered Compression

!  Pleased

Jubilant

Recumbent

!  Ditzy

Weird! 

Geeky

!  Blank

!  Dirty

Thirsty 

Label each of the following moods with one of the

following seven categories: happy, sad, angry, fearful,

disgusted, surprised or none of the above.

!  Guilty

!  Hot

!  Worried

!   Nervous

!  Hungry

!   Nostalgic 

!  Artistic

!  Crushed

Giggly

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 33/36

LiveJournal Mood Hierarchy!

  angry (#2)

aggravated (#1)!  annoyed (#3)

!   bitchy (#110)

!  cranky (#8)

!  cynical (#104)

!  enraged (#12)

!  frustrated (#47)

!  grumpy (#95)

!  infuriated (#19)

!  irate (#20)

irritated (#112)

moody (#23)

 pissed off (#24)

stressed (#28)

!  rushed (#100)

!  awake (#87)

confused (#6)!  curious (#56)

determined (#45)

!   predatory (#118)

!  devious (#130)

energetic (#11)

!   bouncy (#59)

!  hyper (#52)

!  enthralled (#13)

happy (#15)

!  amused (#44)

cheerful (#125)!  chipper (#99)

!  ecstatic (#98)

!  excited (#41) 

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 34/36

Happy Sad AngryEnergetic Confused AggravatedBouncy Crappy AngryHappy Crushed BitchyHyper Depressed Enraged

Cheerful Distressed InfuriatedEcstatic Envious IrateExcited Gloomy Pissed offJubilant GuiltyGiddy IntimidatedGiggly Jealous

LonelyRejectedSad

Scared

K-Means Clustering

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 35/36

K-Means ClusteringNumber of Posts per Mood

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

  e  n  r  a  g  e

  d

  p  i  s  s  e  d

   o  f  f

  i  n  f  u  r  i  a

  d  e  d

  a  n  g  r  y

  i  r  a  t  e

  b  i  t  c  h  y

  a  g  g  r  a  v

  a  t  e  d

 

  g  i  d  d

  y

  h  y  p  e

  r

  e  n  e  r  g  e

  t  i  c

  e  c  s  t  a  t  i  c

  g  i  g  g

  l  y

  c  h  e  e

  r  f  u  l

  b  o  u  n

  c  y

  j   u  b  i  l  a  n  t

  e  x  c  i  t

  e  d

  h  a  p  p

  y

  e  n  v  i  o  u

  s

  l  o  n  e

  l  y

  d  e  p  r  e  s

  s  e  d

  i  n  t  i  m

  i  d  a  t  e  d

  s  c  a  r  e  d

  g  u  i  l  t  y

  c  o  n  f  u  s

  e  d

  r  e  j   e  c  t  e  d

  c  r  a  p

  p  y

  d  i  s  t  r  e  s  s  e  d

  g  l  o  o

  m  y

  j   e  a  l  o  u

  s  s  a

  d

  c  r  u  s

  h  e  d

8/9/2019 158 9(Clustering)

http://slidepdf.com/reader/full/158-9clustering 36/36