feature selection for regression problems

Feature Selection for Regression Problems

M. Karagiannopoulos, D. Anyfantis, S. B. Kotsiantis, P. E. Pintelas

Educational Software Development Laboratory

andComputers and Applications Laboratory

Department of Mathematics, University of

Patras, Greece

Scope

To investigate the most suitable wrapper feature selection technique (if any) for some well known regression algorithms.

Contents

Introduction Feature selection techniques Wrapper algorithms Experiments Conclusions

Introduction What is the feature subset selection

problem? Occurs prior to the learning (induction) algorithm Selection of the relevant features (variables) that

influence the prediction of the learning algorithm

Why feature selection is important?

May improve performance of learning algorithm

Learning algorithm may not scale up to the size of the full feature set either in sample or time

Allows us to better understand the domain

Cheaper to collect a reduced set of features

Characterising features Generally, features are characterized as:

Relevant: These are features which have an influence on the output and their role can not be assumed by the rest

Irrelevant: Irrelevant features are defined as those features not having any influence on the output, and whose values are generated at random for each example.

Redundant: A redundancy exists whenever a feature can take the role of another (perhaps the simplest way to model redundancy).

Typical Feature Selection – First step

Generation Evaluation

Stopping Criterion Validation

OriginalFeature Set Subset

Goodness of the subset

No Yes

1 2

3 4

Generates subset of features for evaluation

Can start with:

•no features

•all features

•random subset of features

Typical Feature Selection – Second step





No Yes

1 2

3 4

Measures the goodness of the subset

Compares with the previous best subset

if found better, then replaces the previous best subset

Typical Feature Selection – Third step





No Yes

1 2

3 4

Based on Generation Procedure:

•Pre-defined number of features

•Pre-defined number of iterations

Based on Evaluation Function:

•whether addition or deletion of a

feature does not produce a better

subset•whether optimal subset based on

some evaluation function is achieved

Typical Feature Selection - Fourth step





No Yes

1 2

3 4

Basically not part of the feature selection process itself

- compare results with already established results or results from competing feature selection methods

Categorization of feature selection techniques

Feature selection methods are grouped into two broad groups: Filter methods that take the set of data

(features) attempting to trim some and then hand this new set of features to the learning algorithm

Wrapper methods that use as evaluation measure the accuracy of the learning algorithm

Argument for wrapper methods

The estimated accuracy of the learning algorithm is the best available heuristic for measuring the values of features.

Different learning algorithms may perform better with different feature sets, even if they are using the same training set.

Wrapper selection algorithms (1)

The simplest method is forward selection (FS). It starts with the empty set and greedily adds features one at a time (without backtracking).

Backward stepwise selection (BS) starts with all features in the feature set and greedily removes them one at a time (without backtracking).

Wrapper selection algorithms (2) The Best First search starts with an empty set of features

and generates all possible single feature expansions. The subset with the highest evaluation is chosen and is expanded in the same manner by adding single features (with backtracking). The Best First search (BFFS) can be combined with forward or backward selection (BFBS).

Genetic algorithm selection. A solution is typically a fixed length binary string representing a feature subset—the value of each position in the string represents the presence or absence of a particular feature. The algorithm is an iterative process where each successive generation is produced by applying genetic operators such as crossover and mutation to the members of the current generation.

Experiments

For the purpose of the present study, we used 4 well known learning algorithms (RepTree, M5rules, K*, SMOreg), the presented feature selection algorithms and 12 dataset from the UCI repository.

Methodology of experiments The whole training set was divided into ten

mutually exclusive and equal-sized subsets and for each subset the learner was trained on the union of all of the other subsets.

The best features are selected according to the feature selection algorithm and the performance of the subset is measured by how well it predicts the values of the test instances.

This cross validation procedure was run 10 times for each algorithm and the average value of the 10-cross validations was calculated.

Experiment with regression tree - RepTree

BS is slightly better feature selection method (on the average)than the others for the RepTree.

WS FS BS BFFS BFBS GS Average correlation coefficient

0.72 0.73 0.74 0.73 0.73 0.73

Experiment with rule learner- M5rules

Datasets WS FS BS BFFS BFBS GS Average correlation coefficient

0.79 0.82 0.83 0.82 0.83 0.83

BS, BFBS and GS are the best feature selection methods (on the average) for the M5rules learner.

Experiment with instance based learner - K*


0.71 0.79 0.8 0.79 0.8 0.79

BS and BFBS is the best feature selection methods (on the average) for K* algorithm

Experiment with SMOreg


0.8 0.81 0.81 0.81 0.81 0.81

Similar results from all feature selection methods

Conclusions None of the described feature selection algorithms is

superior to others in all data sets for a specific learning algorithm.

None of the described feature selection algorithms is superior to others in all data sets.

Backward selection strategies are very inefficient for large-scale datasets, which may have hundreds of original features.

Forward selection wrapper methods are less able to improve performance of a given classifier, but they are less expensive in terms of the computational effort and use fewer features for the induction.

Genetic selection typically requires a large number of evaluations to reach a minimum.

Future Work

We will use a light filter feature selection procedure as a preprocessing step in order to reduce the computational cost of the wrapping procedure without harming accuracy.

feature selection for regression problems

Documents

feature selection process

wrapper selection algorithms

different feature sets

forward selection fs

irrelevant features

values of features

set of data features

new set of features