object orie’d data analysis, last time

Post on 12-Jan-2016

20 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Object Orie’d Data Analysis, Last Time. Finished Q-Q Plots Assess variability with Q-Q Envelope Plot SigClust When is a cluster “really there”? Statistic: 2-means Cluster Index Gaussian null distribution Fit to data (for HDLSS data, using invariance) P-values by simulation - PowerPoint PPT Presentation

TRANSCRIPT

Object Orie’d Data Analysis, Last Time

• Finished Q-Q Plots– Assess variability with Q-Q Envelope Plot

• SigClust– When is a cluster “really there”?

– Statistic: 2-means Cluster Index

– Gaussian null distribution

– Fit to data (for HDLSS data, using invariance)

– P-values by simulation

– Breast Cancer Data

More on K-Means Clustering

Classical Algorithm (from MacQueen,1967)

• Start with initial means

• Cluster: each data pt. to closest mean

• Recompute Class mean

• Stop when no change

Demo from:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

More on K-Means Clustering

Raw Data

2 StartingCenters

More on K-Means Clustering

Assign Each Data Point To NearestCenter

Recompute Mean

Re-assign

More on K-Means Clustering

Recompute Mean

Re-AssignData Points To NearestCenter

More on K-Means Clustering

Recompute Mean

Re-AssignData Points To NearestCenter

More on K-Means Clustering

Recompute Mean

Final Assignment

More on K-Means Clustering

New ExampleRaw Data

DeliberatelyStrange Starting Centers

More on K-Means Clustering

Assign ClustersTo GivenMeans

Note poor clustering

More on K-Means Clustering

Recompute Mean

Re-assign

ShowsImprovement

More on K-Means Clustering

Recompute Mean

Re-assign

ShowsImprovement

Now very good

More on K-Means Clustering

Different Example

Best 2-meansCluster?

Local Minima?

More on K-Means Clustering

Assign

Recompute Mean

Re-assign

Note poor clustering

More on K-Means Clustering

Recompute Mean

Final Assignment

Stuck in Local Min

More on K-Means Clustering

Same Data

But slightly differentstarting points

Impact???

More on K-Means Clustering

Assign

Recompute Mean

Re-assign

Note poor clustering

More on K-Means Clustering

Recompute Mean

Final Assignment

Now get Global Min

More on K-Means Clustering

???Next time:

Redo above, using my own Matlab

calculations

That way can show each step

And get right answers.

More on K-Means Clustering

Now explore starting values:

• Approach randomly choose 2 data points

• Give stable solutions?

• Explore for different point configurations

• And try 100 random choices

• Do 2-d examples for easy visualization

More on K-Means Clustering2 Clusters: Raw Data (Normal mixture)

More on K-Means Clustering2 Clusters: Cluster Index, based on 100 Random Starts

More on K-Means Clustering2 Clusters: Chosen Clustering

More on K-Means Clustering

2 Clusters Results

• All starts end up with good answer

• Answer is very good (CI = 0.03)

• No obvious local minima

More on K-Means ClusteringStretched Gaussian: Raw Data

More on K-Means ClusteringStretched Gaussian : C. I., based on 100 Random Starts

More on K-Means ClusteringStretched Gaussian : Chosen Clustering

More on K-Means Clustering

Stretched Gaussian Results

• All starts end up with same answer

• Answer is less good (CI = 0.35)

• No obvious local minima

More on K-Means ClusteringStandard Gaussian: Raw Data

More on K-Means ClusteringStandard Gaussian : C. I., based on 100 Random Starts

More on K-Means ClusteringStandard Gaussian: Chosen Clustering

More on K-Means Clustering

Standard Gaussian Results

• All starts end up with same answer

• Answer even less good (CI = 0.62)

• No obvious local minima

• So still stable, despite poor CI

More on K-Means Clustering4 Balanced Clusters: Raw Data (Normal mixture)

More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering

4 Balanced Clusters 100 Random Starts

• Many different solutions appear

• I.e. there are many local minima

• Sorting on CI (bottom) shows how many

• 2 seem smaller than others

• What are other local minima?

Understand with deeper visualization

More on K-Means Clustering4 Balanced Clusters: Class Assignment Image Plot

More on K-Means Clustering4 Balanced Clusters: Vertically Regroup (better view?)

More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases

More on K-Means Clustering4 Balanced Clusters: Choose cases to “flip” – color cases

More on K-Means Clustering4 Balanced Clusters: “flip”, shows local min clusters

More on K-Means Clustering4 Balanced Clusters: sort columns, for better visualization

More on K-Means Clustering4 Balanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering4 Balanced Clusters: Color according to local minima

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, smallest CI

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, 2nd small CI

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 3rd CI

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 4th CI

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 5th CI

More on K-Means Clustering4 Balanced Clusters: Chosen Clustering, larger 6th CI

More on K-Means Clustering

4 Balanced Clusters Results

• Many Local Minima

• Two good ones appear often (2-2 splits)

• 4 worse ones (1-3 splits less common)

• 1 with single strange point

• Overall very unstable

• Raises concern over starting values

More on K-Means Clustering4 Unbalanced Clusters: Raw Data (try for stability)

More on K-Means Clustering4 Unbalanced Clusters: CI, based on 100 Random Starts

More on K-Means Clustering4 Unbalanced Clusters: Recolor by CI

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, smallest CI

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, 2nd small CI

More on K-Means Clustering4 Unbalanced Clusters: Chosen Clustering, larger 3rd CI

More on K-Means Clustering

4 Unbalanced Clusters Results

• Fewer Local Minima (more stable)

• Two good ones appear often (2-2 splits)

• Single 1-3 split less common

• Previous instability caused by balance?

• Maybe stability OK after all?

More on K-Means ClusteringData on Circle: Raw Data (maximal instability?)

More on K-Means ClusteringData on Circle: CI, based on 100 Random Starts

More on K-Means ClusteringData on Circle: Recolor by CI

More on K-Means ClusteringData on Circle: Chosen Clustering, smallest CI

More on K-Means ClusteringData on Circle : Chosen Clustering, 2nd small CI

More on K-Means ClusteringData on Circle : Chosen Clustering, 3rd small CI

More on K-Means Clustering

Data on Circle Results

• Seems many local minima

• Several are the same?

• Could be programming error?

• But clear this is an unstable example

K-Means Clustering Caution

• This is all a personal view

• Others would present different aspects

• E.g. replace Euclidean dist. by others

• E.g. other types of clustering

• E.g. heat-map dendogram views

SigClust Breast Cancer Data

K-means Clustering & Starting Values

Try 100 random Starts

For full data set: Study Final CIs

• Shows just two solutions

Study changes in data, with image view• Shows little difference between these

Overall: Typical for clusters can split When Split is Clear, easily find it

SigClust Random Restarts, Full Data

SigClust Random Restarts, Full Data

SigClust Breast Cancer Data

For full Chuck Class (e.g. Luminal B): Study Final CIs

• Shows several solutions

Study changes in data, with image view• Shows multiple, divergent minima

Overall: Typical for “terminal” clusters When no clear split, many local optima appear

Could base test on number of local optima???

SigClust Random Restarts, Luminal B

SigClust Random Restarts, Luminal B

SigClust Breast Cancer Data

??? Next time: show many more of these

To better build this case….

top related