158 9(clustering)
TRANSCRIPT
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 1/36
Clustering
adapted from:
Doug Downey and Bryan Pardo, Northwestern University
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 2/36
Bagging! Use bootstrapping to generate L training sets
and train one base-learner with each
(Breiman, 1996)! Use voting
! Unstable algorithms profit from bagging
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 3/36
Boosting! Given a large training set, randomly divide it into 3
sets (X1, X2, and X3)
! Use X1 to train D1
! Test D1 with X2
! Training set for D2 = Take all instances from X2misclassified by D1 (and also as many instances onwhich D1 is correct from X2)
! Test D1 and D2 with X3!
Training set for D3 = The instances from X3 onwhich D1 and D2 disagree
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 4/36
AdaBoost
Generate a
sequence of
base-learners
each focusing
on previous
one’s errors
(Freund and
Schapire,
1996)
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 5/36
Mixture of Experts
Voting where weights are input-dependent (gating)
(Jacobs et al., 1991)!=
=
L
j j j
d w y 1
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 6/36
Stacking! Combiner f () is
another learner
(Wolpert, 1992)
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 7/36
CascadingUse d j only if
preceding ones
are not confident
Cascade
learners in order
of complexity
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 9/36
Clustering
! Grouping data into (hopefully useful) sets.
Things on the rightThings on the left
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 10/36
Clustering!
Unsupervised Learning!
No labels
! Why do clustering?! Labeling is costly
! Data pre-processing
! Text Classification (e.g., search engines, Google Sets)
! Hypothesis Generation/Data Understanding
! Clusters might suggest natural groups.
! Visualization
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 11/36
Some definitions
! Let X be the dataset:
! An m-clustering of X is a partition of X into m
sets (clusters) C1,…Cm such that:
},...,,{ 321 n x x x x X =
if {}, :overlapnotdoClusters .3
:Xof allcoverClusters .2
1{}, :empty-nonareClusters .1
1
i jC C
X C
miC
ji
i
m
i
i
!="
=
##!
=!
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 12/36
How many possible clusterings?
(Stirling numbers)
S (n,m) = 1m!
("1)m"i
i=0
m
# m
i
$
% &
'
( ) in
Size of
dataset
Number
of clusters
6810)5,100(
901,115,232,45)4,20(
101,375,2)3,15(
!
=
=
S
S
S
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 13/36
What does this mean?
! We can’t try all possible clusterings.
! Clustering algorithms look at a small fractionof all partitions of the data.
!
The exact partitions tried depend on the kindof clustering used.
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 14/36
Who is right?!
Different techniques cluster the same dataset DIFFERENTLY.
! Who is right? Is there a “right” clustering?
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 15/36
Classic Example: Half Moons
From Batra et al., http://www.cs.cmu.edu/~rahuls/pub/bmvc2008-clustering-rahuls.pdf
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 16/36
Steps in Clustering
! Select Features
! Define a Proximity Measure
! Define Clustering Criterion! Define a Clustering Algorithm
! Validate the Results
! Interpret the Results
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 17/36
Kinds of Clustering
! Sequential! Fast
! Cost Optimization!
Fixed number of clusters (typically)
! Hierarchical!
Start with many clusters
!
join clusters at each step
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 18/36
m =1
C 1 = { x1}
For i = 2 to n
Find C k :d ( xi ,C k )=
min" j d ( xi,C j )
If (d ( xi,C k ) >#) and (m < q)
m = m+1
C m = { x i}
Else
C k = C k ${ x i}
End
End
d(x,C) = the distance between feature
vector x and cluster C.
! = the threshold of dissimilarity
q = the maximum number of clusters
n = the number of data points
! Basic Sequential Algorithmic
Scheme (BSAS)S. Theodoridis and K. Koutroumbas, Pattern
Recognition, Academic Press, London England, 1999
! Assumption: The number of
clusters is not known in advance.
A Sequential Clustering Method
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 19/36
A Cost-optimization method
! K-means clustering! J. B. MacQueen (1967): "Some Methods for classification and Analysis of
Multivariate Observations, Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability" , Berkeley, University of California Press
1:281-297 ! A greedy algorithm
! Partitions n examples into k clusters
!
minimizes the sum of the squared distancesto the cluster centers
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 20/36
The K-means algorithm
1. Place K points into the space represented by theobjects that are being clustered. These pointsrepresent initial group centroids (means).
2. Assign each object to the group that has the closest
centroid (mean).
3. When all objects have been assigned, recalculate thepositions of the K centroids (means).
4. Repeat Steps 2 and 3 until the centroids no longer
move.
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 21/36
K-means clustering
! The way to initialize the mean values is not specified.
! Randomly choose k samples?
! Results depend on the initial means
! Try multiple starting points?
! Assumes K is known.
! How do we chose this?
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 22/36
k-Means Clustering
! Find k reference vectors (centroids) whichbest represent data
! Reference vectors, m j , j =1,...,k
!
Use nearest (most similar) reference:
xt =>miwheremi has min
j x
t "m j
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 24/36
Reconstruction Error
{ }
!"
!
#
$ %=%
=
%= & &=
otherwise0
minif1
1
j t
j i
t t
i
t i i t t
i k i i
b
b E
m x m x
m x m X
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 27/36
Leader Cluster Algorithm! Instance far away from all centroids (dist >
threshold) => becomes a new centroid
! Cluster that covers a large number ofinstances (num > threshold) => split into 2
clusters
! Cluster that covers too few instances (num <
threshold) can be removed (and perhapsrandomly assigned to another random data
point)
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 28/36
Choosing K
! Defined by the application, e.g., image
quantization
! PCA
! Incremental (leader-cluster) algorithm: Add
one at a time until “elbow” (reconstruction
error)! Manual check for meaning
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 29/36
Supervised Learning After
Clustering
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 30/36
Naïve Bayes Mood Classifier
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 32/36
Human Powered Compression
! Pleased
!
Jubilant
!
Recumbent
! Ditzy
!
Weird!
Geeky
! Blank
! Dirty
!
Thirsty
Label each of the following moods with one of the
following seven categories: happy, sad, angry, fearful,
disgusted, surprised or none of the above.
! Guilty
! Hot
! Worried
! Nervous
! Hungry
! Nostalgic
! Artistic
! Crushed
!
Giggly
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 33/36
LiveJournal Mood Hierarchy!
angry (#2)
!
aggravated (#1)! annoyed (#3)
! bitchy (#110)
! cranky (#8)
! cynical (#104)
! enraged (#12)
! frustrated (#47)
! grumpy (#95)
! infuriated (#19)
! irate (#20)
!
irritated (#112)
!
moody (#23)
!
pissed off (#24)
!
stressed (#28)
! rushed (#100)
! awake (#87)
!
confused (#6)! curious (#56)
!
determined (#45)
! predatory (#118)
! devious (#130)
!
energetic (#11)
! bouncy (#59)
! hyper (#52)
! enthralled (#13)
!
happy (#15)
! amused (#44)
!
cheerful (#125)! chipper (#99)
! ecstatic (#98)
! excited (#41)
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 34/36
Happy Sad AngryEnergetic Confused AggravatedBouncy Crappy AngryHappy Crushed BitchyHyper Depressed Enraged
Cheerful Distressed InfuriatedEcstatic Envious IrateExcited Gloomy Pissed offJubilant GuiltyGiddy IntimidatedGiggly Jealous
LonelyRejectedSad
Scared
K-Means Clustering
8/9/2019 158 9(Clustering)
http://slidepdf.com/reader/full/158-9clustering 35/36
K-Means ClusteringNumber of Posts per Mood
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
e n r a g e
d
p i s s e d
o f f
i n f u r i a
d e d
a n g r y
i r a t e
b i t c h y
a g g r a v
a t e d
g i d d
y
h y p e
r
e n e r g e
t i c
e c s t a t i c
g i g g
l y
c h e e
r f u l
b o u n
c y
j u b i l a n t
e x c i t
e d
h a p p
y
e n v i o u
s
l o n e
l y
d e p r e s
s e d
i n t i m
i d a t e d
s c a r e d
g u i l t y
c o n f u s
e d
r e j e c t e d
c r a p
p y
d i s t r e s s e d
g l o o
m y
j e a l o u
s s a
d
c r u s
h e d