support vector machines

121
Support Vector Machines Graphical View, using Toy Example:

Upload: montrell-seamus

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Support Vector Machines. Graphical View, using Toy Example:. Distance Weighted Discrim ’ n. Graphical View, using Toy Example:. HDLSS Discrim ’ n Simulations. Wobble Mixture:. HDLSS Discrim ’ n Simulations. Wobble Mixture: 80% dim. 1 , other dims 0 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Support Vector Machines

Support Vector MachinesGraphical View, using Toy Example:

Page 2: Support Vector Machines

Distance Weighted Discrim’n

Graphical View, using Toy Example:

Page 3: Support Vector Machines

HDLSS Discrim’n Simulations

Wobble Mixture:

Page 4: Support Vector Machines

HDLSS Discrim’n Simulations

Wobble Mixture:80% dim. 1 , other dims 020% dim. 1 ±0.1, rand dim ±100,

others 0• MD still very bad, driven by outliers• SVM & DWD are both very robust• SVM loses (affected by margin push)• DWD slightly better (by w’ted

influence)• Methods converge for higher

dimension??Ignore RLR (a mistake)

2.21

Page 5: Support Vector Machines

Melanoma Data• Study Differences Between

(Malignant) Melanoma & (Benign) Nevi

Use Image Features as Before

(Recall from Transformation Discussion)

Paper: Miedema et al (2012)

Page 6: Support Vector Machines

March 17, 2010, 6

Clinical diagnosis

BackgroundIntroduction

Page 7: Support Vector Machines

ROC Curve

Slide

Cutoff

To

Trace

Out

Curve

Page 8: Support Vector Machines

Melanoma Data

Subclass

DWD

Direction

Page 9: Support Vector Machines

Melanoma Data

Full Data

DWD

Direction

Page 10: Support Vector Machines

Melanoma Data

Full Data

ROC

Analysis

AUC = 0.93

Page 11: Support Vector Machines

Melanoma Data

SubClass

ROC

Analysis

AUC = 0.95

Better,

Makes

Intuitive

Sense

Page 12: Support Vector Machines

ClusteringIdea: Given data • Assign each object to a class• Of similar objects• Completely data driven• I.e. assign labels to data• “Unsupervised Learning”

Contrast to Classification (Discrimination)• With predetermined classes• “Supervised Learning”

nXX ,...,

1

Page 13: Support Vector Machines

K-means Clustering

Clustering Goal:

• Given data

• Choose classes

• To miminize

KCC ,...,1

nXX ,...,

1

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

Page 14: Support Vector Machines

2-means Clustering

Page 15: Support Vector Machines

2-means Clustering

Study CI, using simple 1-d examples

• Over changing Classes (moving b’dry)

• Multi-modal data interesting effects

– Local mins can be hard to find

– i.e. iterative procedures can “get stuck”

(even in 1 dimension, with K = 2)

Page 16: Support Vector Machines

SWISS ScoreAnother Application of CI (Cluster Index)

Cabanski et al (2010)

Page 17: Support Vector Machines

SWISS ScoreAnother Application of CI (Cluster Index)

Cabanski et al (2010)

Idea: Use CI in bioinformatics to“measure quality of data preprocessing”

Page 18: Support Vector Machines

SWISS ScoreAnother Application of CI (Cluster Index)

Cabanski et al (2010)

Idea: Use CI in bioinformatics to“measure quality of data preprocessing”

Philosophy: Clusters Are Scientific GoalSo Want to Accentuate Them

Page 19: Support Vector Machines

SWISS ScoreToy Examples (2-d): Which are “More Clustered?”

Page 20: Support Vector Machines

SWISS ScoreToy Examples (2-d): Which are “More Clustered?”

Page 21: Support Vector Machines

SWISS Score

SWISS =

Standardized Within class Sum of Squares

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 22: Support Vector Machines

SWISS Score

SWISS =

Standardized Within class Sum of Squares

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 23: Support Vector Machines

SWISS Score

SWISS =

Standardized Within class Sum of Squares

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 24: Support Vector Machines

SWISS Score

SWISS =

Standardized Within class Sum of Squares

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 25: Support Vector Machines

SWISS ScoreNice Graphical Introduction:

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 26: Support Vector Machines

SWISS ScoreNice Graphical Introduction:

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 27: Support Vector Machines

SWISS ScoreNice Graphical Introduction:

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 28: Support Vector Machines

SWISS ScoreRevisit Toy Examples (2-d): Which are “More Clustered?”

Page 29: Support Vector Machines

SWISS ScoreToy Examples (2-d): Which are “More Clustered?”

Page 30: Support Vector Machines

SWISS ScoreToy Examples (2-d): Which are “More Clustered?”

Page 31: Support Vector Machines

SWISS Score

Now Consider K > 2:

How well does it work?

n

ii

Ciji

K

j

K

XX

XX

CCCISWISS j

1

2

2

1

1 ,,

Page 32: Support Vector Machines

SWISS Score

K = 3, Toy Example:

Page 33: Support Vector Machines

SWISS Score

K = 3, Toy Example:

Page 34: Support Vector Machines

SWISS Score

K = 3, Toy Example:

Page 35: Support Vector Machines

SWISS Score

K = 3, Toy Example:

But C

Seems

“Better

Clustered”

Page 36: Support Vector Machines

SWISS Score

K-Class SWISS:

Instead of using K-Class CI

Use Average of Pairwise SWISS Scores

Page 37: Support Vector Machines

SWISS Score

K-Class SWISS:

Instead of using K-Class CI

Use Average of Pairwise SWISS Scores

(Preserves [0,1] Range)

Page 38: Support Vector Machines

SWISS Score

Avg. Pairwise SWISS – Toy Examples

Page 39: Support Vector Machines

SWISS Score

Additional Feature:

Ǝ Hypothesis Tests: H1: SWISS1 < 1

H1: SWISS1 < SWISS2

Permutation Based

See Cabanski et al (2010)

Page 40: Support Vector Machines

Clustering

• A Very Large Area

• K-Means is Only One Approach

Page 41: Support Vector Machines

Clustering

• A Very Large Area

• K-Means is Only One Approach

• Has its Drawbacks

(Many Toy Examples of This)

Page 42: Support Vector Machines

Clustering

• A Very Large Area

• K-Means is Only One Approach

• Has its Drawbacks

(Many Toy Examples of This)

• Ǝ Many Other Approaches

Page 43: Support Vector Machines

Clustering

Recall Important References (monographs):

• Hartigan (1975)

• Gersho and Gray (1992)

• Kaufman and Rousseeuw (2005)

See Also Wikipedia

Page 44: Support Vector Machines

Clustering

• A Very Large Area

• K-Means is Only One Approach

• Has its Drawbacks

(Many Toy Examples of This)

• Ǝ Many Other Approaches

• Important (And Broad) Class

Hierarchical Clustering

Page 45: Support Vector Machines

Hiearchical ClusteringIdea: Consider Either:

Bottom Up Aggregation:

One by One Combine Data

Top Down Splitting:

All Data in One Cluster & Split

Through Entire Data Set, to get Dendogram

Page 46: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Thanks to US EPA: water.epa.gov

Page 47: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals,

Move Up

Page 48: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals

Page 49: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals,

Move Up

Page 50: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals,

Move Up

Page 51: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals,

Move Up

Page 52: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Aggregate:

Start With

Individuals,

Move Up,

End Up With

All in 1 Cluster

Page 53: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Split:

Start With

All in 1 Cluster

Page 54: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Split:

Start With

All in 1 Cluster,

Move Down

Page 55: Support Vector Machines

Hiearchical Clustering

Aggregate or Split, to get Dendogram

Split:

Start With

All in 1 Cluster,

Move Down,

End With

Individuals

Page 56: Support Vector Machines

Hiearchical ClusteringVertical Axis in Dendogram:

Based on:

Distance Metric

(Ǝ many, e.g. L2, L1, L∞, …)

Linkage Function

(Reflects Clustering Type)

Page 57: Support Vector Machines

Hiearchical ClusteringVertical Axis in Dendogram:

Based on:

Distance Metric

(Ǝ many, e.g. L2, L1, L∞, …)

Linkage Function

(Reflects Clustering Type)

A Lot of “Art” Involved

Page 58: Support Vector Machines

Hiearchical Clustering

Dendogram Interpretation

Branch

Length

Reflects

Cluster

Strength

Page 59: Support Vector Machines

SigClust

• Statistical Significance of Clusters• in HDLSS Data• When is a cluster “really there”?

Page 60: Support Vector Machines

SigClust

• Statistical Significance of Clusters• in HDLSS Data• When is a cluster “really there”?

Liu et al (2007), Huang et al (2014)

Page 61: Support Vector Machines

SigClust

Co-authors:

Andrew Nobel – UNC Statistics & OR

C. M. Perou – UNC Genetics

D. N. Hayes – UNC Oncology & Genetics

Yufeng Liu – UNC Statistics & OR

Hanwen Huang – U Georgia Biostatistics

Page 62: Support Vector Machines

Common Microarray Analytic Approach: Clustering

From: Perou, Brown,Botstein (2000) Molecular MedicineToday

d = 1161 genes

Zoomed to “relevant”Gene subsets

Page 63: Support Vector Machines

Interesting Statistical Problem

For HDLSS data:When clusters seem to appear

E.g. found by clustering method

How do we know they are really there?Question asked by Neil Hayes

Define appropriate statistical significance?

Can we calculate it?

Page 64: Support Vector Machines

First Approaches: Hypo Testing

Idea: See if clusters seem to be

Significantly Different Distributions

Page 65: Support Vector Machines

First Approaches: Hypo Testing

Idea: See if clusters seem to be

Significantly Different Distributions

There Are Several Hypothesis Tests

Most Consistent with Visualization:

Direction, Projection, Permutation

Page 66: Support Vector Machines

DiProPerm Hypothesis Test

Context: 2 – sample means

H0: μ+1 = μ-1 vs. H1: μ+1 ≠ μ-1

(in High Dimensions)

Wei et al (2013)

Page 67: Support Vector Machines

DiProPerm Hypothesis Test

Context: 2 – sample means

H0: μ+1 = μ-1 vs. H1: μ+1 ≠ μ-1

Challenges: Distributional Assumptions Parameter Estimation

Page 68: Support Vector Machines

DiProPerm Hypothesis Test

Context: 2 – sample means

H0: μ+1 = μ-1 vs. H1: μ+1 ≠ μ-1

Challenges: Distributional Assumptions Parameter Estimation

HDLSS space is slippery

Page 69: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

See

Structure?

Page 70: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

See

Structure?

Careful,

Only PC1-4

Page 71: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

Structure

Looks

Real???

Page 72: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

Actually

Both

Classes

Are N(0,I),

d = 1000

Page 73: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

Actually

Both

Classes

Are N(0,I),

d = 1000

Page 74: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

Separation

Is Natural

Sampling

Variation

Page 75: Support Vector Machines

DiProPerm Hypothesis Test

Context: 2 – sample means

H0: μ+1 = μ-1 vs. H1: μ+1 ≠ μ-1

Challenges: Distributional Assumptions Parameter Estimation

HDLSS space is slippery

Page 76: Support Vector Machines

DiProPerm Hypothesis Test

Context: 2 – sample means

H0: μ+1 = μ-1 vs. H1: μ+1 ≠ μ-1

Challenges: Distributional Assumptions Parameter Estimation

Suggested Approach:

Permutation test

Page 77: Support Vector Machines

DiProPerm Hypothesis Test

Suggested Approach: Find a DIrection

(separating classes)

Page 78: Support Vector Machines

DiProPerm Hypothesis Test

Suggested Approach: Find a DIrection

(separating classes) PROject the data

(reduces to 1 dim)

Page 79: Support Vector Machines

DiProPerm Hypothesis Test

Suggested Approach: Find a DIrection

(separating classes) PROject the data

(reduces to 1 dim) PERMute

(class labels, to assess significance,

with recomputed direction)

Page 80: Support Vector Machines

DiProPerm Hypothesis Test

Page 81: Support Vector Machines

DiProPerm Hypothesis Test

Page 82: Support Vector Machines

DiProPerm Hypothesis Test

Page 83: Support Vector Machines

DiProPerm Hypothesis Test

Page 84: Support Vector Machines

DiProPerm Hypothesis Test

.

.

.Repeat this 1,000 times

To get:

Page 85: Support Vector Machines

DiProPerm Hypothesis Test

Page 86: Support Vector Machines

DiProPerm Hypothesis Test

Toy 2-Class

Example

p-value

Not

Significant

Page 87: Support Vector Machines

DiProPerm Hypothesis Test

Real Data Example: Autism

Caudate Shape

(sub-cortical brain structure)

Shape summarized by 3-d locations of 1032 corresponding points

Autistic vs. Typically Developing

Page 88: Support Vector Machines

DiProPerm Hypothesis Test

Finds

Significant

Difference

Despite Weak

Visual

Impression

Thanks to Josh Cates

Page 89: Support Vector Machines

DiProPerm Hypothesis Test

Also Compare: Developmentally Delayed

No

Significant

Difference

But Strong

Visual

ImpressionThanks to Josh Cates

Page 90: Support Vector Machines

DiProPerm Hypothesis Test

Two

Examples

Which Is

“More

Distinct”?

Visually Better Separation? Thanks to Katie Hoadley

Page 91: Support Vector Machines

DiProPerm Hypothesis Test

Two

Examples

Which Is

“More

Distinct”?

Stronger Statistical Significance! Thanks to Katie Hoadley

Page 92: Support Vector Machines

DiProPerm Hypothesis Test

Value of DiProPerm: Visual Impression is Easily Misleading

(onto HDLSS projections,

e.g. Maximal Data Piling) Really Need to Assess Significance DiProPerm used routinely

(even for variable selection)

Page 93: Support Vector Machines

DiProPerm Hypothesis Test

Choice of Direction: Distance Weighted Discrimination

(DWD) Support Vector Machine (SVM) Mean Difference Maximal Data Piling

Page 94: Support Vector Machines

DiProPerm Hypothesis Test

Choice of 1-d Summary Statistic: 2-sample t-stat Mean difference Median difference Area Under ROC Curve

Page 95: Support Vector Machines

Interesting Statistical Problem

For HDLSS data:When clusters seem to appear

E.g. found by clustering method

How do we know they are really there?Question asked by Neil Hayes

Define appropriate statistical significance?

Can we calculate it?

Page 96: Support Vector Machines

First Approaches: Hypo Testing

e.g. Direction, Projection, Permutation Hypothesis test of:

Significant difference between sub-populationsEffective and AccurateI.e. Sensitive and SpecificThere exist several such testsBut critical point is:

What result implies about clusters

Page 97: Support Vector Machines

Clarifying Simple Example

Why Population Difference Tests cannot indicate clustering

Andrew Nobel ObservationFor Gaussian Data (Clearly 1 Cluster!)Assign Extreme Labels

(e.g. by clustering)Subpopulations are signif’ly different

Page 98: Support Vector Machines

Simple Gaussian Example

Clearly only 1 Cluster in this ExampleBut Extreme Relabelling looks differentExtreme T-stat strongly significantIndicates 2 clusters in data

Page 99: Support Vector Machines

Simple Gaussian Example

Results:Random relabelling T-stat is not significantBut extreme T-stat is strongly significantThis comes from clustering operationConclude sub-populations are different

Page 100: Support Vector Machines

Simple Gaussian Example

Results:Random relabelling T-stat is not significantBut extreme T-stat is strongly significantThis comes from clustering operationConclude sub-populations are differentNow see that:

Not the same as clusters really there

Page 101: Support Vector Machines

Simple Gaussian Example

Results:Random relabelling T-stat is not significantBut extreme T-stat is strongly significantThis comes from clustering operationConclude sub-populations are differentNow see that:

Not the same as clusters really thereNeed a new approach to study clusters

Page 102: Support Vector Machines

Statistical Significance of Clusters

Basis of SigClust Approach:

What defines: A Single Cluster?A Gaussian distribution (Sarle & Kou 1993)

Page 103: Support Vector Machines

Statistical Significance of Clusters

Basis of SigClust Approach:

What defines: A Single Cluster?A Gaussian distribution (Sarle & Kou 1993)

So define SigClust test based on:2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically???

Page 104: Support Vector Machines

SigClust Statistic – 2-Means Cluster Index

Measure of non-Gaussianity:2-means Cluster IndexFamiliar Criterion from k-means Clustering

Page 105: Support Vector Machines

SigClust Statistic – 2-Means Cluster Index

Measure of non-Gaussianity:2-means Cluster IndexFamiliar Criterion from k-means ClusteringWithin Class Sum of Squared Distances to

Class MeansPrefer to divide (normalize) by Overall Sum

of Squared Distances to MeanPuts on scale of proportions

Page 106: Support Vector Machines

SigClust Statistic – 2-Means Cluster Index

Measure of non-Gaussianity:2-means Cluster Index:

Class Index Sets Class Means

“Within Class Var’n” / “Total Var’n”

22

1

2

1

|| ||

,|| ||

k

k

cj

k j C

n

jj

x x

CIx x

Page 107: Support Vector Machines

SigClust Gaussian null distribut’n

Which Gaussian (for null)?

Page 108: Support Vector Machines

SigClust Gaussian null distribut’n

Which Gaussian (for null)?

Standard (sphered) normal?No, not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian

Page 109: Support Vector Machines

SigClust Gaussian null distribut’n

Which Gaussian (for null)?

Standard (sphered) normal?No, not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian

Need Gaussian more like data:Need Full modelChallenge: Parameter EstimationRecall HDLSS Context

Page 110: Support Vector Machines

SigClust Gaussian null distribut’n

Estimated Mean, (of Gaussian dist’n)?1st Key Idea: Can ignore this

Page 111: Support Vector Machines

SigClust Gaussian null distribut’n

Estimated Mean, (of Gaussian dist’n)?1st Key Idea: Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0

Page 112: Support Vector Machines

SigClust Gaussian null distribut’n

Estimated Mean, (of Gaussian dist’n)?1st Key Idea: Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0Other uses of invariance ideas?

Page 113: Support Vector Machines

SigClust Gaussian null distribut’n

Challenge: how to estimate cov. Matrix ?

Page 114: Support Vector Machines

SigClust Gaussian null distribut’n

Challenge: how to estimate cov. Matrix ?Number of parameters:E.g. Perou 500 data:

Dimension

so

But Sample Size

2

)1( dd

533n

46,797,9752

)1(

dd9674d

Page 115: Support Vector Machines

SigClust Gaussian null distribut’n

Challenge: how to estimate cov. Matrix ?Number of parameters:E.g. Perou 500 data:

Dimension

so

But Sample Size Impossible in HDLSS settings????

Way around this problem?

2

)1( dd

533n

46,797,9752

)1(

dd9674d

Page 116: Support Vector Machines

SigClust Gaussian null distribut’n

2nd Key Idea: Mod Out Rotations

Page 117: Support Vector Machines

SigClust Gaussian null distribut’n

2nd Key Idea: Mod Out RotationsReplace full Cov. by diagonal matrixAs done in PCA eigen-analysis

But then “not like data”???

tMDM

Page 118: Support Vector Machines

SigClust Gaussian null distribut’n

2nd Key Idea: Mod Out RotationsReplace full Cov. by diagonal matrixAs done in PCA eigen-analysis

But then “not like data”???OK, since k-means clustering (i.e. CI) is

rotation invariant

(assuming e.g. Euclidean Distance)

tMDM

Page 119: Support Vector Machines

SigClust Gaussian null distribut’n

2nd Key Idea: Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems?

Page 120: Support Vector Machines

SigClust Gaussian null distribut’n

2nd Key Idea: Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems?

E.g. Perou 500 data:

Dimension

Sample Size

Still need to estimate param’s9674d533n9674d

Page 121: Support Vector Machines

SigClust Gaussian null distribut’n

3rd Key Idea: Factor Analysis Model