survival-time classification of breast cancer patients dimacs workshop on data mining and scalable...

23
Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg Second Annual Review June 1, 2001 Data Mining Institute University of Wisconsin - Madison

Upload: mervyn-fields

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival-Time Classification of Breast Cancer Patients

DIMACS Workshop on Data Mining and Scalable AlgorithmsAugust 22-24, 2001- Rutgers University

Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg

Second Annual Review

June 1, 2001

Data Mining Institute

University of Wisconsin - Madison

Page 2: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

American Cancer Society2001 Breast Cancer Estimates

Breast cancer, the most common cancer among women, is the second leading cause of cancer deaths in women (after lung cancer) 192,200 new cases of breast cancer in women will be diagnosed in the United States 40,600 deaths will occur from breast cancer (40,200 among women, 400 among men) in the United StatesAccording to the World Health Organization, more than 1.2 million people will be diagnosed with breast cancer this year worldwide

Page 3: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Key Objective

Identify breast cancer patients for whom adjuvant chemotherapy prolongs survival time Main Difficulty: Cannot carry out comparative

tests on human subjects Similar patients must be treated similarly

Our Approach: Classify patients into:Good, Intermediate & Poor groups Classification based on: 5 cytological features

plus Tumor size Classification criteria: Tumor size & Lymph

node status

Page 4: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Principal ResultsFor 253 Breast Cancer Patients

All 69 patients in the Good group: Had the best survival rate Had no chemotherapy

All 73 patients in the Poor group: Had the worst survival rate Had chemotherapy

For the 121 patients in the Intermediate group: The 67 patients who had chemotherapy had better survival rate than: The 44 patients who did not have chemotherapy

Last result reverses role of chemotherapy for both the overall population as well as the Good & Poor groups

Page 5: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Outline

Tools used Support vector machines (SVMs).

Feature selection Classification

Clustering k-Median (k-Mean fails!)

Cluster chemo patients into chemo-good & chemo-poor Cluster no-chemo patients into no-chemo-good & no-chemo-poor Three final classes

Good = No-chemo good Poor = Chemo poor Intermediate = Remaining patients

Generate survival curves for three classes Use SVM to classify new patients into one of above three classes

Page 6: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Support Vector Machines Used in this Work

6 out of 31 features selected:

Feature selection: SVM with 1-norm approach, SVM jjájj1

s. t.

÷e0y+ kwk1

D(Awà eí ) + y > e

y > 0;w;ímin

,

, denotes Lymph node > 0 or where D ii = æ1Lymph node =0

Classification: Use SSVMs with Gaussian kernel

5 out 30 cytological features describe nuclear size, shape and texture

Tumor size

Page 7: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Clustering in Data Mining

General Objective

Given: A dataset of m points in n-dimensional real space

Problem: Extract hidden distinct properties by clustering the dataset

Page 8: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Concave Minimization Formulationof Clustering Problem

, and a numberA 2 Rmâ n

Given: Set A of m points in Rn represented by the matrixk of desired clusters

` 2 f1; . . .;kgi = 1;. . .;m

Problem: Determine centers ` = 1;. . .;kC` Rn, in suchthat the sum of the minima over of the1-norm distance between each point A i , ,

C`and cluster centers , ` = 1;. . .k is minimized

k Objective: Sum of m minima of linear functions, hence it is piecewise-linear concave

Difficulty: Minimizing a general piecewise-linear concavefunction over a polyhedral set is NP-hard

Page 9: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Clustering via Concave Minimization

Reformulation:

i = 1;. . .;m; ` = 1;. . .;k

C`;D i ` 2 R n;Ti ` 2 R

P

i=1

m P

`=1

kTi`e0D i`

à D i` ô A0i à C` ô D i`

P`=1k Ti` = 1; Ti` õ 0

min

s.t.

à D i` ô A0i à C` ô D i`

i = 1;. . .;m;` = 1;. . .;k

C`;D i `

P

i=1

m

` = 1; . . .; kf e0D i `gmin min

s.t.

Minimize the sum of 1-norm distances between each dataA ipoint C` :and the closest cluster center

Page 10: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Finite K-Median Clustering Algorithm(Minimizing Piecewise-linear Concave Function)

Step 0 (Initialization): Given k initial cluster centers Different initial centers will lead to different clusters

Step 1 (Cluster Assignment): Assign points to the cluster withthe nearest cluster center in 1-norm

Step 2 (Center Update) Recompute location of center for eachcluster as the cluster median (closest point to all clusterpoints in 1-norm)

Step3 (Stopping Criterion) Stop if the cluster centers are unchanged, else go to Step 1

Page 11: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Clustering Process: Feature Selection & Initial Cluster Centers

6 out of 31 features selected by a linear SVM ( SVM jjájj1) SVM separating lymph node positive (Lymph > 0)

from lymph node negative (Lymph = 0)

Perform k-Median algorithm in 6-dimensional feature space

Initial cluster centers used: Medians of Good1 & Poor1

Good1: Patients with Lymph = 0 AND Tumor < 2

Poor1: Patients with Lymph > 4 OR Tumor õ 4 Typical indicator for chemotherapy

Page 12: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Clustering Process

253 Patients(113 NoChemo, 140 Chemo)

Cluster 113 NoChemo PatientsUse k-Median Algorithm with Initial Centers:

Medians of Good1 & Poor1

69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor

Good PoorIntermediate

Cluster 140 Chemo PatientsUse k-Median Algorithm with Initial Centers:

Medians of Good1 & Poor1

Good1:Lymph=0 AND Tumor<2

Compute Median Using 6 Features

Poor1:Lymph>=5 OR Tumor>=4

Compute Median Using 6 Features

Compute InitialCluster Centers

Page 13: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival Curves forGood, Intermediate & Poor Groups

Page 14: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival Curves for Intermediate Group:Split by Chemo & NoChemo

Page 15: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival Curves for All PatientsSplit by Chemo & NoChemo

Page 16: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival Curves for Intermediate GroupSplit by Lymph Node & Chemotherapy

Page 17: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Survival Curves for All PatientsSplit by Lymph Node Positive & Negative

Page 18: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Nonlinear SVM Classifier82.7% Tenfold Test Correctness

Good2:Good & ChemoGood

Poor2:NoChemoPoor & Poor

Compute LI(x) & CI(x)

Compute LI(x) & CI(x)

SVM

Good Intermediate

Good PoorIntermediate(ChemoGood)

Intermediate(NoChemoPoor)

Four groups from the clustering result:

SVM

PoorIntermediate

SVM

Page 19: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Conclusion

Used five feature from a fine needle aspirate & tumor sizeto cluster breast cancer patients into 3 groups: Good – No chemotherapy recommended Intermediate – Chemotherapy likely to prolong survival Poor – Chemotherapy may or may not enhance survival

3 groups have very distinct survival curves First categorization of a breast cancer group for which

chemotherapy enhances longevity

Prescribe a SVM classification procedure to classify new patients into one of above three groups

Page 20: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Simplest Support Vector MachineLinear Surface Maximizing the Margin

x0w = í + 1

x0w = í à 1

A+

A-

jjwjj2

w

= Margin

Page 21: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Key Objective

Identify breast cancer patients for whom adjuvant chemotherapy prolongs survival time Main Difficulty: Cannot carry out comparative tests on human subjects

Similar patients must be treated similarly Our Approach: Classify patients into: good, intermediate & poor groups

Characterize classes by: Tumor size & lymph node status Classification based on: 5 cytological features plus tumor size

Page 22: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Clustering Process: Feature Selection & Initial Cluster Centers

6 out of 31 features selected by a linear SVM SVM separating lymph node positive (Lymph>0) from lymph node negative (Lymph=0)

Clustering performed in 6-dimensional feature space Initial cluster centers used:

Good: Median in 6-dimensional space of patients with Lymph=0 AND Tumor <2

Poor: Median in 6-dimensional space of patients with of Lymph>4 OR Tumor >4

Typical indicator for chemotherapy

Page 23: Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, 2001- Rutgers University Y.-J

Conclusion

By using five features from a fine needle aspirate & tumor size, breast cancer patients can be classified into 3 classes

Good – Requiring no chemotherapy Intermediate – Chemotherapy recommended for longer survival Poor – Chemotherapy may or may not enhance survival

3 classes have very distinct survival curves First categorization of a breast cancer group for which chemotherapy enhances longevity