unsupervised learning & cluster analysis: basic concepts and algorithms
DESCRIPTION
Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms. Assaf Gottlieb. Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar. What is unsupervised learning & Cluster Analysis ?. - PowerPoint PPT PresentationTRANSCRIPT
Unsupervised learning &Cluster Analysis: Basic
Concepts and Algorithms
Assaf Gottlieb
Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar
What is unsupervised learning & Cluster Analysis ? Learning without a priori knowledge about the classification of samples; learning without a teacher.
Kohonen (1995), “Self-Organizing Maps”
“Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.”
B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”
What do we cluster?
Samples/Instances
Features/Variables
Applications of Cluster Analysis Understanding
Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations
Data Exploration Get insight into data distribution Understand patterns in the data
SummarizationReduce the size of large data setsA preprocessing step
Objectives of Cluster Analysis Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Competing objectives
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters
Two Clusters
Six Clusters
Depends on “resolution” !
PrerequisitesUnderstand the nature of your
problem, the type of features, etc.
The metric that you choose for similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.
Similarity/Distance measures Euclidean Distance
Highly depends on scaleof features may require normalization
City Block
N
n nn yxyxd1
2)(),(
d
iiiM wxD
1
1
deuc=0.5846 deuc=1.1345
deuc=2.6115 These examples of Euclidean distance match our intuition of dissimilarity pretty well…
deuc=1.41 deuc=1.22
…But what about these?
What might be going on with the expression profiles on the left? On the right?
Similarity/Distance measures Cosine
Pearson Correlation
Invariant to scaling (Pearson also to addition) Spearman correlation for ranks
yx
yxNyxC
N
i ii
1
cosine
1
),(
])(][)([
))((),(
1
2
1
2
1
N
i yi
N
i xi
N
i yixipearson
mymx
mymxyxC
Similarity/Distance measures Jaccard similarity
When interested in intersection size
YX
YXYXJSim
),(
X YX ∩ Y
X U Y
Types of Clusterings Important distinction between
hierarchical and partitional sets of clusters
Partitional Clustering A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Hierarchical clustering A set of nested clusters organized as a hierarchical
tree
Partitional Clustering
Original Points A Partitional Clustering
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Dendrogram 2
Dendrogram 1
Other Distinctions Between Sets of Clustering methods
Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to
multiple clusters. Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1 Weights must sum to 1
Partial versus complete In some cases, we only want to cluster some of the
data Heterogeneous versus homogeneous
Cluster of widely different sizes, shapes, and densities
Clustering Algorithms Hierarchical clustering K-means Bi-clustering
Hierarchical Clustering Produces a set of nested clusters
organized as a hierarchical tree Can be visualized as a dendrogram
A tree like diagram that records the sequences of merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering Do not have to assume any particular
number of clusters Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the proper level
They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering Two main types of hierarchical clustering
Agglomerative (bottom up): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
Divisive (top down): Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point
(or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time
Agglomerative Clustering Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
Different approaches to defining the distance between clusters distinguish the different algorithms
Starting Situation Start with clusters of individual points and
a proximity matrixp1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
After Merging The question is “How do we update the proximity
matrix?”
C1
C4
C2 U C5
C3? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Ward’s method (not discussed)
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids
Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the
two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one
link in the proximity graph.
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
• Sensitive to noise and outliers
Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the
two least similar (most distant) points in the different clusters Determined by all pairs of points in the two
clustersI1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters
Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
Need to use average connectivity for scalability since total proximity favors large clusters
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 34 5
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
1
2
5
3
4
Hierarchical Clustering: Group Average Compromise between Single and
Complete Link
Strengths Less susceptible to noise and outliers
Limitations Biased towards globular clusters
Hierarchical Clustering: Comparison
Group Average
MIN MAX
1
2
3
4
5
61
2
5
34
1
2
3
4
5
61
2 5
3
41
2
3
4
5
6
12
3
4
5
Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot
be undone Different schemes have problems with one or more of the
following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex
shapes Breaking large clusters (divisive)
Dendrogram correspond to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right
They impose structure on the data, instead of revealing structure in these data.
How many clusters? (some suggestions later)
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple
K-means Clustering – Details Initial centroids are often chosen randomly.
Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured mostly by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d )
n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
Typical choice
Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest
error One easy way to reduce SSE is to increase K, the number of
clusters A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
K
i Cxi
i
xmdistSSE1
2 ),(
Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in
Shape Density Size
Assumes clusters are spherical in vector space Sensitive to coordinate changes
Two different K-means Clusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 6
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Importance of Choosing Initial Centroids …
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 5
Importance of Choosing Initial Centroids …
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Solutions to Initial Centroids Problem Multiple runs Sample and use hierarchical clustering to
determine initial centroids Select more than k initial centroids and
then select among these initial centroids Select most widely separated
Bisecting K-means Not as susceptible to initialization issues
Bisecting K-means Bisecting K-means algorithm
Variant of K-means that can produce a partitional or a hierarchical clustering
Bisecting K-means Example
Issues and Limitations for K-means How to choose initial centers? How to choose K?
Depends on the problem + some suggestions later
How to handle Outliers? Preprocessing
Clusters different in Shape Density Size
Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in
Shape Density Size
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.Find parts of clusters, but need to put together.
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
K-meansPros Simple Fast for low dimensional data It can find pure sub clusters if large number of
clusters is specifiedCons K-Means cannot handle non-globular data of
different sizes and densities K-Means will not identify outliers K-Means is restricted to data which has the
notion of a center (centroid)
Biclustering/Co-clustering Two genes can have similar
expression patterns only under some conditions
Similarly, in two related conditions, some genes may exhibit different expression patterns N
genes
M conditions
Biclustering As a result, each
cluster may involve only a subset of genes and a subset of conditions, which form a “checkerboard” structure:
Biclustering In general – a hard task (NP-hard) Heuristic algorithms described briefly: Cheng & Church – deletion of rows and
columns. Biclusters discovered one at a time
Order-Preserving SubMatrixes Ben-Dor et al.
Coupled Two-Way Clustering (Getz. Et al) Spectral Co-clustering
Cheng and Church Objective function for heuristic methods (to
minimize):
Greedy method: Initialization: the bicluster contains all rows and
columns. Iteration:
1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.2. Remove a row or column that gives the maximum
decrease of H. Termination: when no action will decrease H or H <= . Mask this bicluster and continue
Problem removing “trivial” biclusters
JjIi
IJIjiJij aaaaJI
JIH,
2)(1
),(
Ben-Dor et al. (OPSM) Model:
For a condition set T and a gene g, the conditions in T can be ordered in a way so that the expression values are sorted in ascending order (suppose the values are all unique).
Submatrix A is a bicluster if there is an ordering (permutation) of T such that the expression values of all genes in G are sorted in ascending order.
Idea of algorithm: to grow partial models until they become complete models.
t1 t2 t3 t4 t5
g1 7 13 19 2 50
g2 19 23 39 6 42
g3 4 6 8 2 10
Induced permutation
2 3 4 1 5
Ben-Dor et al. (OPSM)
Getz et al. (CTWC) Idea: repeatedly perform one-way
clustering on genes/conditions. Stable clusters of genes are used as the
attributes for condition clustering, and vice versa.
Spectral Co-clustering Main idea:
Normalize the 2 dimension Form a matrix of size m+n (using SVD) Use k-means to cluster both types of data
http://adios.tau.ac.il/SpectralCoClustering/
Evaluating cluster quality Use known classes (pairwise F-measure,
best class F-measure) Clusters can be evaluated with
“internal” as well as “external” measures Internal measures are related to the
inter/intra cluster distance External measures are related to how
representative are the current clusters to “true” classes
Inter/Intra Cluster Distances
Intra-cluster distance (Sum/Min/Max/Avg) the
(absolute/squared) distance between- All pairs of points in
the cluster OR- Between the centroid
and all points in the cluster OR
- Between the “medoid” and all points in the cluster
Inter-cluster distanceSum the (squared) distance
between all pairs of clusters
Where distance between two clusters is defined as:
- distance between their centroids/medoids
- (Spherical clusters)- Distance between the
closest pair of points belonging to the clusters
- (Chain shaped clusters)
Davies-Bouldin index A function of the ratio of the sum of within-
cluster (i.e. intra-cluster) scatter to between cluster (i.e. inter-cluster) separation
Let C={C1,….., Ck} be a clustering of a set of N objects:
with and
where Ci is the ith cluster and ci is the centroid for cluster i
k
iiR
kDB
1
.1
jikjiji RR
,,..1
max||||
)var()var(
ji
ji
jiij cc
CCR
Davies-Bouldin index example For eg: for the clusters shown
Compute
var(C1)=0, var(C2)=4.5, var(C3)=2.33
Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33
So, R12=1, R13=0.152, R23=0.797
Now, compute
R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31 and R32)
Finally, compute DB=0.932
||||
)var()var(
ji
ji
jiij cc
CCR
jikjiji RR
,,..1
max
k
iiR
kDB
1
.1
Davies-Bouldin index example (ctd) For eg: for the clusters shown
Compute
Only 2 clusters here var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33
R12=1.26 Now compute Since we have only 2 clusters here, R1=R12=1.26;
R2=R21=1.26 Finally, compute DB=1.26
||||
)var()var(
ji
ji
jiij cc
CCR
jikjiji RR
,,..1
max
k
iiR
kDB
1
.1
Other criteria Dunn method
(Xi, Xj): intercluster distance between clusters Xi and Xj
(Xk): intracluster distance of cluster Xk
Silhouette method Identifying outliers
C-index Compare sum of distances S over all pairs from the same
cluster against the same # of smallest and largest pairs.
)(max
),(minmin)(
111
kck
ji
ijcjci X
XXUV
Example datasetAML/ALL dataset (Golub et al.)
samples
Genes/ features
Leukemia 72 patients (samples) 7129 genes 4 groups
Two major types ALL & AML T & B Cells in ALL With/without treatment in
AML
AML/ALL dataset Davies-Bouldin index - C=4
Dunn method - C=2
Silhouette method – C=2
Validity index
c = 2 c = 3 c = 4 c = 5 c = 6
V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23
V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40
V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90
V62 3.26 1.66 1.69 1.36 1.28
V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55
V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47
Average 1.34 0.68 0.69 0.57 0.57
Validity index
c = 2 c = 3 c = 4 c = 5 c = 6
V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23
V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40
V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90
V62 3.26 1.66 1.69 1.36 1.28
V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55
V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47
Average 1.34 0.68 0.69 0.57 0.57
Visual evaluation - coherency
Cluster quality example do you see clusters?
C Silhouette
2 0.49223 0.57394 0.47735 0.49916 0.54047 0.5418 0.51719 0.595610 0.6446
C Silhouette
2 0.48633 0.57624 0.59575 0.53516 0.57017 0.54878 0.50839 0.531110 0.5229
Dimensionality Reduction Map points in high-dimensional space to
lower number of dimensions Preserve structure: pairwise distances, etc. Useful for further processing:
Less computation, fewer parameters Easier to understand, visualize
Dimensionality Reduction Feature selection vs. Feature Extraction Feature selection – select important
features Pros
meaningful features Less work acquiring
Unsupervised Variance, Fold UFF
Dimensionality Reduction Feature Extraction Transforms the entire feature set to lower
dimension. Pros
Uses objective function to select the best projection
Sometime single features are not good enough Unsupervised
PCA, SVD
Principal Components Analysis (PCA) approximating a high-dimensional data set
with a lower-dimensional linear subspace
****
******
**
** ****
******
**
**
**
**
**** **
****
**
****
Data pointsData points
First principal componentFirst principal componentSecond principal componentSecond principal component
Original axesOriginal axes
Singular Value Decomposition
Principal Components Analysis (PCA) Rule of thumb for selecting number of
components “Knee” in screeplot Cumulative percentage variance
****
******
**
** ****
******
**
**
**
**
**** **
****
**
****
Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/
Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/
Tools for clustering Cluster +TreeView (Eisen et al.) http://rana.lbl.gov/eisen/?page_id=42
Summary Clustering is ill-defined and considered an
“art” In fact, this means you need to
Understand your data beforehand Know how to interpret clusters afterwards
The problem determines the best solution (which measure, which clustering algorithm) – try to experiment with different options.