bioinformatics challenge learning in very high dimensions with very few samples acute leukemia...

Bioinformatics Challenge

Learning in very high dimensions with very few samples

Acute leukemia dataset: 7129 # of gene vs. 72 samples

Colon cancer dataset: 2000 # of gene vs. 62 samples

Feature selection will be needed

Feature Selection Approach

Filter model Weight score approach

Wrapper model 1-norm SVM IRSVM

Feature Selection –Filter Model Using Weight Score

Approach

Feature 1 Feature 2 Feature 3

(ö+1 à öà

1 ) (ö+2 à öà

2 ) (ö+3 à öà

3 )

Filter Model –Weight Score Approach

Weight score:

wj û+

j+û

à

j

ö+

jà ö

à

j=

where andöj ûj are the mean and standard deviation of j th

feature for training examples of positive or negative class.

Filter Model –Weight Score Approach

is defined as the ratio between the difference of the means of expression levels and the sum of standard deviation in two classes.

wj

jwj j Selecting genes with largest as our top features.

wj The weight score is calculated with the information about a single feature.

The highly linear correlated features might be selected by this approach.

(Different Measure of Margin)

D(Aw+ eb) + ø>e

1-Norm SVM: min(w;b;ø)2R n+1+l

jjwjj1 + Ce0ø

D(Aw+ eb) + ø>e

ø> 0

1-Norm SVM

Equivalent to: min(s;w;b;ø)2R 2n+1+l

e0s + Ce0ø

ø> 0à s 6 w 6 s

Good for feature selection!

Clustering Process: Feature Selection & Initial Cluster

Centers

6 out of 31 features selected by a linear SVM (SVM jjájj1)

mean of area, standard error of area, worst area, worst texture, worst perimeter and tumor size

Reduced Support Vector Machine

K (x0;Aö0)uö+ b= 0

(ii) Solve the following problem by the Newton’s method

2÷kp(eà D(K (A;A0)uö + eb);ë)k2

2 + 21(kuök2

2 + b2)min(u;b) 2 Rm+1

uö0K (Aö;x) + b=P

i=1

mö

uöiK (Aöi;x) + b= 0

(iii) The nonlinear classifier is defined by the optimal solution

(u;b)in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix of entire data matrixA 2 Rmâ n; (m << m):

A 2 Rmâ n

Nonlinear Classifier:

Reduced Set: Plays the Most Important Role in RSVM

It is natural to raise two questions: Is there a way to choose the reduced set

other than random selection so that RSVM will have a better performance? Is there a mechanism to determine the

size of reduced set automatically or dynamically?

Aö 2 Rmöâ n

Reduced Set SelectionAccording to the Data Scatter in Input

Space

Expected these points to be representative sample

Choose reduced set randomly but only keep the points in the reduced set that are more than a

certain minimal distance apart

12

3

54

6

78

9

11

10

12

A Better WayAccording to the Data Scatter in Feature

Space

An example is given as following :

Training data analogous to XOR problem

Mapping to Feature Space

Map the input data via nonlinear mapping：

þ : (x1;x2) ! ((x1)2;(x2)2; 2p

x1x2)

K (x;z) =êx áz

ë2

= (x1z1 + x2z2)2

= (x21z

21 + x2

2z22 + 2x1x2z1z2)

Equivalent to polynomial kernel with degree 2:

Data Points in the Feature Space

12

3

54

6

78

9

1110

12

36

25

14

8 11

9 12

7 10

The Polynomial Kernel Matrix

12

3

54

6

78

9

1110

12

Experiment Result

Mathematical ObservationsAnother Reason for IRSVM

is a linear combination of a set of kernel functions

If the kernel functions are very similar, thehypothesis space spanned by this kernel functions will be very limited.

In SVMs, the nonlinear separating surface is:

K (x0;A0)Du + b= 0

In RSVMs, the nonlinear separating surfaceP

i=1

mö

K (x0;Aö0i)Döi iuöi + b= 0

Incremental Reduced SVMsThe strength of weak ties

Start with a very small reduced set , then add anew data point only when the kernel

vector is dissimilar to the current function set

This point contributes the most extra informationfor generating the separating surface

Repeat until several successive points cannot be added

The strength of weak ties (….)

The distance from the kernel vector to the column space of is greater than a threshold

The criterion for adding a point into reduced set is This distance can be determined by solving a

least squares problem

How to measure the dissimilar? Solving Least Squares

Problems

It has a unique solution , and the distance is

í 2 = K (A;Aö0)ì ã à K (A;A0i)

íí

íí 2

2

K (A;Aö0)ì

í 2K (A;A0

i)

K (A;A0i)ì

ã

IRSVM Algorithm pseudo-code

(sequential version)

1 Randomly choose two data from the training data as the initial reduced set

2 Compute the reduced kernel matrix

3 For each data point not in the reduced set

4 Computes its kernel vector

5 Computes the distance from the kernel vector

6 to the column space of the current reduced kernel matrix

7 If its distance exceed a certain threshold

8 Add this point into the reduced set and form the new reduced kernel matrix

9 Until several successive failures happened in line 7

10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel

11 A new data point is classified by the separating surface

Wrapper Model – IRSVMFind a Linear Classifier:

I. Randomly choose a very small feature subset from the input features as the initial feature reduced set.

II. Select a feature vector not in the current feature reduced set and computing the distance between this vector and the space spanned by current feature reduced set.III. If the distance is larger than a given gap, then we add this feature vector to the feature reduced set.

IV. Repeat step II and step III until there are no feature can be added to the current feature reduced set.

V. Features in the resulting feature reduced set is our final result of feature selection.

f (x) =P

i=1

n

wixi + b

bioinformatics challenge learning in very high dimensions with very few samples acute leukemia...

Documents