20 features 100 features 1000 features - stanford university

20
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 18 1.0 1.5 2.0 2.5 3.0 20 9 2 20 features Test Error 1.0 1.5 2.0 2.5 3.0 99 35 7 100 features 1.0 1.5 2.0 2.5 3.0 99 87 43 1000 features Effective Degrees of Freedom FIGURE 18.1. Test-error results for simulation ex- periments. Shown are boxplots of the relative test er- rors over 100 simulations, for three different values of p, the number of features. The relative error is the test error divided by the Bayes error, σ 2 . From left to right, results are shown for ridge regression with three differ- ent values of the regularization parameter λ: 0.001, 100 and 1000. The (average) effective degrees of freedom in the fit is indicated below each plot.

Upload: others

Post on 20-Nov-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e er

ror

20 9 2

20 features

Tes

t Err

or

1.0

1.5

2.0

2.5

3.0

99 35 7

100 features

1.0

1.5

2.0

2.5

3.0

99 87 43

1000 features

Effective Degrees of Freedom

FIGURE 18.1. Test-error results for simulation ex-periments. Shown are boxplots of the relative test er-rors over 100 simulations, for three different values ofp, the number of features. The relative error is the testerror divided by the Bayes error, σ2. From left to right,results are shown for ridge regression with three differ-ent values of the regularization parameter λ: 0.001, 100and 1000. The (average) effective degrees of freedom inthe fit is indicated below each plot.

Page 2: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

(0,0)

Δ

FIGURE 18.2. Soft thresholding function

sign(x)(|x| − Δ)+ is shown in orange, along with the

45◦ line in red.

Page 3: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

BL EWS NB RMS

FIGURE 18.3. Heat-map of the chosen 43 genes.Within each of the horizontal partitions, we have or-dered the genes by hierarchical clustering, and similarlyfor the samples within each vertical partition. Yellowrepresents over- and blue under-expression.

Page 4: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

0 2 4 6

Mis

clas

sific

atio

n E

rror

2308 2059 1223 598 284 159 81 43 23 15 10 5 1

Number of Genes

0.0

0.2

0.4

0.6

0.8

Training10−fold CVTest

Amount of Shrinkage Δ

−1.0 −0.5 0.0 0.5 1.0

BL

050

010

0015

0020

00

−1.0 −0.5 0.0 0.5 1.0

EWS

−1.0 −0.5 0.0 0.5 1.0

NB

−1.0 −0.5 0.0 0.5 1.0

RMS

Gen

e

Centroids: Average Expression Centered at Overall Centroid

FIGURE 18 4 (Top): Error curves for the SRBCT

Page 5: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

−8 −7 −6 −5 −4 −3 −2 −1

0.0

0.5

1.0

1.5

2.0

−8 −7 −6 −5 −4 −3 −2 −1

0.0

0.5

1.0

1.5

2.0

Coeffi

cien

ts

Coeffi

cien

ts

log(λ)log(λ)

Lasso Elastic Net

FIGURE 18.5. Regularized logistic regression pathsfor the leukemia data. The left panel is the lasso path,the right panel the elastic-net path with α = 0.8. At theends of the path (extreme left), there are 19 nonzerocoefficients for the lasso, and 39 for the elastic net.The averaging effect of the elastic net results in morenon-zero coefficients than the lasso, but with smallermagnitudes.

Page 6: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

−8 −7 −6 −5 −4 −3 −2 −1

0.0

0.1

0.2

0.3

0.4

Mis

clas

sific

atio

n E

rror

TrainingTest10−fold CV

−8 −7 −6 −5 −4 −3 −2 −1

05

1015

2025

30

Dev

ianc

e

log(λ)log(λ)

FIGURE 18.6. Training, test, and 10-fold crossvalidation curves for lasso logistic regression on theleukemia data. The left panel shows misclassificationerrors, the right panel shows deviance.

Page 7: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

2e+03 5e+03 1e+04 2e+04 5e+04 1e+05 2e+05

1020

3040

m/z

Inte

nsity

NormalCancer

FIGURE 18.7. Protein mass spectrometry data: average

profiles from normal and prostate cancer patients.

Page 8: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

0 200 400 600 800 1000

−2

02

4

Genome order

log2

ratio

FIGURE 18.8. Fused lasso applied to CGH data.Each point represents the copy-number of a gene in atumor sample, relative to that of a control (on the logbase-2 scale).

Page 9: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curves for String Kernel

Specificity

Sen

sitiv

ity

SVM 0.84Nearest Centroid 0.84One−Nearest Neighbor 0.86

FIGURE 18.9. Cross-validated ROC curves for pro-tein example using the string kernel. The numbersnext to each method in the legend give the area underthe curve, an overall measure of accuracy. The SVMachieves better sensitivities than the other two, whichachieve better specificities.

Page 10: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

predictivebayesusingthan

algorithmare

proceduretechnology

valuesaccuracyvariables

wheninference

thosebayesian

frequentistpropose

presentedmethod

problems

BE HT JF

FIGURE 18.10. Abstracts example: top 20 scores from

nearest shrunken centroids. Each score is the standardized

difference in frequency for the word in the given class (BE,

HT or JF) versus all classes. Thus a positive score (to

the right of the vertical grey zero lines) indicates a higher

frequency in that class; a negative score indicates a lower

relative frequency.

Page 11: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

Time(days)

Pat

ient

0 100 200 300 365

1

2

3

4

FIGURE 18.11. Censored survival data. For illustra-tion there are four patients. The first and third patientsdie before the study ends. The second patient is alive atthe end of the study (365 days), while the fourth patientis lost to follow-up before the study ends. For example,this patient might have moved out of the country. Thesurvival times for patients two and four are said to be“censored.”

Page 12: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

l

l ll lll lll lll

ll l l ll l llllll ll l l l l

Pr(

T≥

t)

Months t

Survival Function

FIGURE 18.12. Lymphoma data. The Ka-plan–Meier estimate of the survival function for the 80patients in the test set, along with one-standard-errorcurves. The curve estimates the probability of surviv-ing past t months. The ticks indicate censored obser-vations.

Page 13: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

Poor Cell Type Good Cell Type

Survival Time

FIGURE 18.13. Underlying conceptual model forsupervised principal components. There are two celltypes, and patients with the good cell type live longeron the average. Supervised principal components esti-mate the cell type, by averaging the expression of genesthat reflect it.

Page 14: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

7399

7350

50

27

1

Gen

es

Patients

Supervised PC

1 80 160

Absolute Cox Score

0 2 4

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of S

urvi

val

low scorehigh score

Best Single Gene

P=0.15

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of S

urvi

val

Supervised Principal Component − 27 Genes

P=0.006

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Months

Pro

babi

lity

of S

urvi

val

Principal Component − 7399 Genes

P=0.14

FIGURE 18.14. Supervised principal components onthe lymphoma data. The left panel shows a heatmapof a subset of the gene-expression training data. Therows are ordered by the magnitude of the univariateCox-score, shown in the middle vertical column. Thetop 50 and bottom 50 genes are shown. The supervisedprincipal component uses the top 27 genes (chosen by10-fold CV). It is represented by the bar at the top of

Page 15: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

FIGURE 18.15. Heatmap of the outcome (left col-umn) and first 500 genes from a realization from model(16.37). The genes are in the columns, and the samplesare in the rows.

Page 16: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

1.00

1.05

1.10

1.15

1.20

1.25

Number of Features

Rel

ativ

e R

oot M

ean

Squ

are

Tes

t Err

or

0 50 100 150 200 250 300 ... 5000

Thresholded PLSSupervised Principal Components

FIGURE 18.16. Root mean squared test error (±one standard error), for supervised principal compo-nents and thresholded PLS on 100 realizations frommodel (16.37). All methods use one component, and theerrors are relative to the noise standard deviation (theBayes error is 1.0). For both methods, different val-ues for the filtering threshold were tried and the num-ber of features retained is shown on the horizontal axis.The extreme right points correspond to regular princi-pal components and partial least squares, using all thegenes.

Page 17: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

0 50 100 150 200 250

3.0

3.2

3.4

3.6

3.8

4.0

4.2

4.4

Number of Features in Model

Mea

n T

est E

rror

LassoSupervised Principal ComponentsPreconditioned Lasso

FIGURE 18.17. Test errors for the lasso, super-vised principal components, and pre-conditioned lasso,for one realization from model (16.37). Each model isindexed by the number of non-zero features. The super-vised principal component path is truncated at 250 fea-tures. The lasso self-truncates at 100, the sample size(see Section 16.4). In this case, the pre-conditionedlasso achieves the lowest error with about 25 features.

Page 18: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

−4 −2 0 2 4

020

040

060

080

0

t−statistics

FIGURE 18.18. Radiation sensitivity microarray ex-ample. A histogram of the 12, 625 t-statistics com-paring the radiation-sensitive versus insensitive groups.Overlaid in blue is the histogram of the t-statistics from1000 permutations of the sample labels.

Page 19: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18

Genes ordered by p−value

p−va

lue

1 5 10 50 100

5*10

^−6

5*10

^−5

5*10

^−4

5*10

^−3

••

• ••

• • •••••

•••••••

••••••••••••••••••

•••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••

FIGURE 18.19. Microarray example continued.Shown is a plot of the ordered p-values p(j) and the line0.15 · (j/12, 625), for the Benjamini–Hochberg method.The largest j for which the p-value p(j) falls below theline, gives the BH threshold, Here this occurs at j = 11,indicated by the vertical line. Thus the BH method callssignificant the 11 genes (in red) with smallest p-values.

Page 20: 20 features 100 features 1000 features - Stanford University

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 18