uic - cs 5941 chapter 5: clustering. uic - cs 5942 searching for groups clustering is unsupervised...
TRANSCRIPT
UIC - CS 594 1
Chapter 5: Clustering
UIC - CS 594 2
Searching for groups Clustering is unsupervised or
undirected. Unlike classification, in clustering, no
pre-classified data. Search for groups or clusters of data
points (records) that are similar to one another.
Similar points may mean: similar customers, products, that will behave in similar ways.
UIC - CS 594 3
Group similar points together
Group points into classes using some distance measures. Within-cluster distance, and between
cluster distance Applications:
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
UIC - CS 594 4
An Illustration
UIC - CS 594 5
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.
City-planning: Identifying groups of houses according to their house type, value, and geographical location
UIC - CS 594 6
Concepts of Clustering
Clusters Different ways of
representing clusters Division with
boundaries Spheres Probabilistic Dendrograms …
1 2 3
I1
I2
…
In
0.5 0.2 0.3
UIC - CS 594 7
Clustering Clustering quality
Inter-clusters distance maximized Intra-clusters distance minimized
The quality of a clustering result depends on both the similarity measure used by the method and its application.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Clustering vs. classification Which one is more difficult? Why? There are a huge number of clustering techniques.
UIC - CS 594 8
Dissimilarity/Distance Measure
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
Weights should be associated with different variables based on applications and data semantics.
It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.
UIC - CS 594 9
Types of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
UIC - CS 594 10
Interval-valued variables Continuous measurements in a roughly
linear scale, e.g., weight, height, temperature, etc
Standardize data (depending on applications) Calculate the mean absolute deviation:
where Calculate the standardized measurement (z-
score)
.)...21
1nffff
xx(xn m
|)|...|||(|121 fnffffff
mxmxmxns
f
fifif s
mx z
UIC - CS 594 11
Similarity Between Objects Distance: Measure the similarity or
dissimilarity between two data objects Some popular ones include: Minkowski
distance:
where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp jxixjxixjxixjid
UIC - CS 594 12
Similarity Between Objects (Cont.)
If q = 2, d is Euclidean distance:
Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, and many other similarity/distance measures.
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
UIC - CS 594 13
Binary Variables A contingency table for binary data
Simple matching coefficient (invariant, if
the binary variable is symmetric):
Jaccard coefficient (noninvariant if the
binary variable is asymmetric):
dcbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
cbacb jid
),(
Object i
Object j
UIC - CS 594 14
Dissimilarity of Binary Variables
Example
gender is a symmetric attribute (not used below) the remaining attributes are asymmetric
attributes let the values Y and P be set to 1, and the value
N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
UIC - CS 594 15
Nominal Variables A generalization of the binary variable in that
it can take more than 2 states, e.g., red, yellow, blue, green, etc
Method 1: Simple matching m: # of matches, p: total # of variables
Method 2: use a large number of binary variables creating a new binary variable for each of the M
nominal states
pmpjid ),(
UIC - CS 594 16
Ordinal Variables
An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled (f is a variable)
replace xif by their ranks
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
compute the dissimilarity using methods for interval-scaled variables
11
f
ifif M
rz
},...,1{fif
Mr
UIC - CS 594 17
Ratio-Scaled Variables Ratio-scaled variable: a measurement on a
nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population.
Methods: treat them like interval-scaled variables—not a good idea!
(why?—the scale can be distorted) apply logarithmic transformation
yif = log(xif) treat them as continuous ordinal data and then treat their
ranks as interval-scaled
UIC - CS 594 18
Variables of Mixed Types A database may contain all six types of
variables symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio One may use a weighted formula to combine
their effects
f is binary or nominal:dij
(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance f is ordinal or ratio-scaled
compute ranks rif and and treat zif as interval-scaled
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
1
1
f
if
Mrz
if
UIC - CS 594 19
Major Clustering Techniques
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.
UIC - CS 594 20
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means : Each cluster is represented by the center of
the cluster k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the cluster
UIC - CS 594 21
The K-Means Clustering Given k, the k-means algorithm is as
follows: 1) Choose k cluster centers to coincide with k
randomly-chosen points2) Assign each data point to the closest cluster
center 3) Recompute the cluster centers using the
current cluster memberships.4) If a convergence criterion is not met, go to
2).Typical convergence criteria are: no (or minimal)
reassignment of data points to new cluster centers, or minimal decrease in squared error.
2
1
||
k
iCp i
impE
p is a point and mi is the mean of cluster Ci
UIC - CS 594 22
Example For simplicity, 1 dimensional data and k=2. data: 1, 2, 5, 6,7 K-means:
Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7};
meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2
+ 1^2 = 2.5
UIC - CS 594 23
Comments on K-Means Strength: efficient: O(tkn), where n is # data points, k
is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. The
global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
Weakness Applicable only when mean is defined, difficult for categorical
data Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes Sensitive to initial seeds
UIC - CS 594 24
Variations of the K-Means Method
A few variants of the k-means which differ in Selection of the initial k seeds Dissimilarity measures Strategies to calculate cluster means
Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with
categorical objects Using a frequency based method to update
modes of clusters
UIC - CS 594 25
k-Medoids clustering method
k-Means algorithm is sensitive to outliers Since an object with an extremely large value may
substantially distort the distribution of the data. Medoid – the most centrally located point in a
cluster, as a representative point of the cluster.
An example
In contrast, a centroid is not necessarily inside a cluster.
Initial Medoids
UIC - CS 594 26
Partition Around Medoids PAM:
1. Given k2. Randomly pick k instances as initial medoids3. Assign each data point to the nearest
medoid x4. Calculate the objective function
the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion)
5. Randomly select an point y6. Swap x by y if the swap reduces the
objective function7. Repeat (3-6) until no change
UIC - CS 594 27
Comments on PAM Pam is more robust than k-means in
the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?)
Pam works well for small data sets but does not scale well for large data sets. O(k(n-k)2 ) for each change
where n is # of data, k is # of clusters
Outlier (100 unit away)
UIC - CS 594 28
CLARA: Clustering Large Applications CLARA: Built in statistical analysis packages,
such as S+ It draws multiple samples of the data set,
applies PAM on each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM Weakness:
Efficiency depends on the sample size A good clustering based on samples will not
necessarily represent a good clustering of the whole data set if the sample is biased
There are other scale-up methods e.g., CLARANS
UIC - CS 594 29
Hierarchical Clustering Use distance matrix for clustering. This
method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
UIC - CS 594 30
Agglomerative Clustering
At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least dissimilarity.Go on mergingEventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
UIC - CS 594 31
A Dendrogram Shows How the Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
UIC - CS 594 32
Divisive Clustering
Inverse order of agglomerative clustering
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
UIC - CS 594 33
More on Hierarchical Methods
Major weakness of agglomerative clustering methods do not scale well: time complexity at least O(n2),
where n is the total number of objects can never undo what was done previously
Integration of hierarchical with distance-based clustering to scale-up these clustering methods BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the cluster by a specified fraction
UIC - CS 594 34
Summary Cluster analysis groups objects based on
their similarity and has wide applications Measure of similarity can be computed for
various types of data Clustering algorithms can be categorized
into partitioning methods, hierarchical methods, density-based methods, etc
Clustering can also be used for outlier detection which are useful for fraud detection
What is the best clustering algorithm?
UIC - CS 594 35
Other Data Mining Methods
UIC - CS 594 36
Sequence analysis
Market basket analysis analyzes things that happen at the same time.
How about things happen over time?E.g., If a customer buys a bed, he/she is
likely to come to buy a mattress later Sequential analysis needs
A time stamp for each data record customer identification
UIC - CS 594 37
Sequence analysis (cont …)
The analysis shows which item come before, after or at the same time as other items.
Sequential patterns can be used for analyzing cause and effect.
Other applications Finding cycles in association rules
Some association rules hold strongly in certain periods of time
E.g., every Monday people buy item X and Y together
Stock market predicting Predicting possible failure in network, etc
UIC - CS 594 38
Discovering holes in data Holes are empty (sparse) regions in the data
space that contain few or no data points. Holes may represent impossible value combinations in the application domain.
E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range.
Such information could lead to significant discovery: a cure to a disease or some biological law.
UIC - CS 594 39
Data and pattern visualization
Data visualization: Use computer graphics effect to reveal the patterns in data, 2-D, 3-D scatter plots, bar charts, pie charts,
line plots, animation, etc. Pattern visualization: Use good
interface and graphics to present the results of data mining.Rule visualizer, cluster visualizer, etc
UIC - CS 594 40
Scaling up data mining algorithms
Adapt data mining algorithms to work on very large databases. Data reside on hard disk (too large to fit in
main memory) Make fewer passes over the data
Quadratic algorithms are too expensive Many data mining algorithms are
quadratic, especially, clustering algorithms.