1 classification and feature selection algorithms for multi-class cgh data jun liu, sanjay ranka,...

1

Classification and Feature Selection Algorithms for Multi-class

CGH data

Jun Liu, Sanjay Ranka, Tamer Kahveci

http://www.cise.ufl.edu

2

Gene copy number

• The number of copies of genes can vary from person to person.– ~0.4% of the gene copy

numbers are different for pairs of people.

• Variations in copy numbers can alter resistance to disease– EGFR copy number can be

higher than normal in Non-small cell lung cancer. Healthy Cancer

Lung images (ALA)

3

Comparative Genomic Hybridization (CGH)

4

Raw and smoothed CGH data

5

Example CGH dataset

862 genomic intervals in the Progenetix database

6

Problem description

•Given a new sample, which class does this sample belong to?

•Which features should we use to make this decision?

7

Outline

• Support Vector Machine (SVM)

• SVM for CGH data

• Maximum Influence Feature Selection algorithm

• Results

8

SVM in a nutshell

Support Vector Machine (SVM)SVM for CGH dataMaximum Influence Feature Selection algorithmResults

9

Classification with SVM• Consider a two-class,

linearly separable classification problem

• Many decision boundaries!

• The decision boundary should be as far away from the data of both classes as possible– We should maximize the

margin, mClass 1

Class 2

m

10

• Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi

• Maximize J over αi

SVM Formulation

Similarity between xi and xj

Similarity between xi and xj

•The decision boundary can be constructed as

11

SVM for CGH data


12

Pairwise similarity measures

• Raw measure– Count the number of genomic intervals that

both samples have gain (or loss) at that position.

Raw = 3

13

SVM based on Raw kernel

• Using SVM with the Raw kernel amounts to solving the following quadratic program

• The resulting decision function is

Maximize J over αi :

Use Raw kernel to replace

Use Raw kernel to replace Is this cool?

14

Is Raw kernel valid?

• Not all similarity function can serve as kernel. This requires the underlying kernel matrix M is “positive semi-definite”.

• M is positive semi-definite if for all vectors v, vTMv ≥ 0

15

• Proof: define a function Φ() where – Φ: a {1, 0, -1}m b {1, 0}2m,where

• Φ(gain) = Φ(1) = 01• Φ(no-change) = Φ(0) = 00• Φ(loss) = Φ(-1) = 10

– Raw(X, Y) =Φ(X)T Φ(Y)

Is Raw kernel valid?

X = 0 1 1 0 1 -1Y = 0 1 0 -1 -1 -1 * *

Φ(X) = 0 0 0 1 0 1 0 0 0 1 1 0Φ(Y) = 0 0 0 1 0 0 1 0 1 0 1 0 * *

Raw(X, Y) = 2 Φ(X)T Φ(Y) = 2

16

Raw Kernel is valid!• Raw kernel can be written as Raw(X, Y) =Φ(X)T Φ(Y)

• Define a 2m by n matrix

• Therefore,

Let M denote theKernel matrix of Raw

17

MIFS algorithm


18

MIFS for multi-class data

One-versus-all SVM

[1, 3, 8] [1, 2, 31][3, 4, 12][5, 8, 15]Sort ranks of features

[8, 1, 3] [2, 31, 1][12, 4, 3]Ranks of features [5, 15, 8]

Feature 1 Feature 2 Feature 3 Feature 4

Sort features [1, 3, 8][1, 2, 31] [3, 4, 12] [5, 8, 15]

Most promising feature. Insert Feature 4 into feature set

1. Feature 82. Feature 43. Feature 94. Feature 335. Feature 26. Feature 487. Feature 278. Feature 1 …

Con

trib

utio

n

High

Low

19

Results


20

Dataset Details

Data taken fromProgenetix database

21

Datasets

Similarity level

#cancers best good fair poor

2 478 466 351 373

4 1160 790 800 800

6 1100 850 880 810

8 1000 830 750 760

Dataset size

22

Experimental results

• Comparison of linear and Raw kernel

0.3

0.5

0.7

0.9

Linear Raw

On average, Raw kernel improves the predictive accuracy by 6.4% over sixteen datasets compared to linear kernel.

23

0.5

0.55

0.6

0.65

0.7

0.75

0.8

4 812 16 20 24 28 30 40 50 60 70 80 90

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

862

All features MIFS MRMR SVM-RFE

Using 80 features results in accuracy that is comparable or better than using all features

Experimental results

Using 40 features results in accuracy that is comparable to using all features

Acc

urac

y

Number of Features

(Fu and Fu-Liu, 2005)(Ding and Peng, 2005)

24

Using MIFS for feature selection

• Result to test the hypothesis that 40 features are enough and 80 features are better

25

A Web Server for Mining CGH Data

http://cghmine.cise.ufl.edu:8007/CGH/Default.html

26

Thank you

27

Appendix

28

Minimum Redundancy and Maximum Relevance (MRMR)

0 1 1 0

0 1 1 0

0 1 1 0

0 0 0 1

0 0 0 0

0 0 0 0

x1

x2

x3

x4

x5

x6

Class

1

1

1

1

-1

-1

Features 1 2 3 4

X

Y

0 1

1

• Relevance V is defined as the average mutual information between features and class labels

• Redundancy W is defined as the average mutual information between all pairs of features

• Incrementally select features by maximizing (V / W) or (V – W)

29

Compute the weight vector

Support Vector Machine Recursive Feature Elimination (SVM-RFE)

Train a linear SVM based on feature set

Compute the ranking coefficient wi2 for the ith feature

Remove the feature with smallest ranking coefficient

Is feature set empty?

N

Y

30

Pairwise similarity measures

• Sim measure– Segment is a contiguous block of aberrations

of the same type.– Count the number of overlapping segment

pairs.

Sim = 2

31

Non-linear Decision Boundary

• How to generalize SVM when the two class classification problem is not linearly separable?

• Key idea: transform xi to a higher dimensional space to “make life easier”– Input space: the space the point xi are located

– Feature space: the space of (xi) after transformation

Input space

( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature space

A linear decisionboundary can

be found!

A linear decisionboundary can

be found!

1 classification and feature selection algorithms for multi-class cgh data jun liu, sanjay ranka,...

Documents

raw kernel valid

linear kernel

underlying kernel matrix

raw kernelusing svm

feature selection algorithms

raw kernelon average

decision function

multiclass cgh datajun