richard jensen and qiang shen prof qiang shen aberystwyth university, uk [email protected] dr. richard...

20
Richard Jensen and Qiang Shen Richard Jensen and Qiang Shen Prof Qiang Shen Prof Qiang Shen Aberystwyth University, UK Aberystwyth University, UK [email protected] [email protected] Dr. Richard Jensen Dr. Richard Jensen Aberystwyth University, UK Aberystwyth University, UK [email protected] [email protected] Interval-valued Fuzzy-Rough Feature Selection in Datasets with Missing Values FUZZ-IEEE 2009

Upload: pearl-payne

Post on 22-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Prof Qiang ShenProf Qiang ShenAberystwyth University, UKAberystwyth University, UK

[email protected]@aber.ac.uk

Dr. Richard JensenDr. Richard JensenAberystwyth University, UKAberystwyth University, UK

[email protected]@aber.ac.uk

Interval-valued Fuzzy-Rough Feature Selectionin Datasets with Missing Values

Interval-valued Fuzzy-Rough Feature Selectionin Datasets with Missing Values

FUZZ-IEEE 2009FUZZ-IEEE 2009

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

OutlineOutline

• The importance of feature selectionThe importance of feature selection

• Rough set theoryRough set theory

• Fuzzy-rough feature selection (FRFS)Fuzzy-rough feature selection (FRFS)

• Interval-valued FRFSInterval-valued FRFS

• ExperimentationExperimentation

• ConclusionConclusion

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

• Why dimensionality reduction/feature selection?Why dimensionality reduction/feature selection?

• Growth of information - need to manage this effectivelyGrowth of information - need to manage this effectively• Curse of dimensionality - a problem for machine learningCurse of dimensionality - a problem for machine learning• Data visualisation - graphing dataData visualisation - graphing data

High dimensionaldata

DimensionalityDimensionalityReductionReduction

Low dimensionaldata

Processing SystemProcessing System

IntractableIntractable

Feature selectionFeature selection

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Feature selectionFeature selection

• Feature selection (FS) is a DR technique that Feature selection (FS) is a DR technique that preserves data semantics (meaning of data)preserves data semantics (meaning of data)

• Subset generationSubset generation: forwards, backwards, random…: forwards, backwards, random…• Evaluation functionEvaluation function: determines ‘goodness’ of subsets: determines ‘goodness’ of subsets• Stopping criterionStopping criterion: decide when to stop subset search: decide when to stop subset search

GenerationGeneration EvaluationEvaluation

StoppingStoppingCriterionCriterion

ValidationValidation

Feature setFeature set SubsetSubset

SubsetSubsetsuitabilitysuitability

ContinueContinue StopStop

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Rough set theoryRough set theory

Rx Rx is the set of all points that are is the set of all points that are indiscernibleindiscernible

with point with point x x in terms of feature subset in terms of feature subset BB

UpperUpperApproximationApproximation

Set ASet A

LowerLowerApproximationApproximation

Equivalence Equivalence class class RxRx

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Rough set feature selectionRough set feature selection

• Attempts to remove unnecessary or Attempts to remove unnecessary or redundant featuresredundant features• EvaluationEvaluation: function based on rough set : function based on rough set

concept of lower approximationconcept of lower approximation

• GenerationGeneration: greedy hill-climbing algorithm : greedy hill-climbing algorithm employedemployed

• Stopping criterionStopping criterion: when maximum evaluation : when maximum evaluation value is reachedvalue is reached

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen7

Fuzzy-rough setsFuzzy-rough sets

Fuzzy-rough setFuzzy-rough set

Fuzzy similarityFuzzy similarity

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Fuzzy-rough setsFuzzy-rough sets

• Fuzzy-rough feature selectionFuzzy-rough feature selection• EvaluationEvaluation: function based on fuzzy-rough lower : function based on fuzzy-rough lower

approximationapproximation

• GenerationGeneration: greedy hill-climbing: greedy hill-climbing

• Stopping criterionStopping criterion: when maximal ‘goodness’ is : when maximal ‘goodness’ is reached (or to degree reached (or to degree αα))

• Problem #1Problem #1: : how to choose fuzzy similarity?how to choose fuzzy similarity?

• Problem #2Problem #2: : how to handle missing values?how to handle missing values?

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Interval-valued FRFSInterval-valued FRFS

IV fuzzy rough setIV fuzzy rough set

IV fuzzy similarityIV fuzzy similarity

• Answer #1Answer #1: Model uncertainty in fuzzy : Model uncertainty in fuzzy similarity by interval-valued similaritysimilarity by interval-valued similarity

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Interval-valued FRFSInterval-valued FRFS

Missing valuesMissing values

• When comparing two object values for a When comparing two object values for a given attribute – what to do if at least one is given attribute – what to do if at least one is missing?missing?

• Answer #2Answer #2: Model missing values via the : Model missing values via the unit intervalunit interval

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Other measuresOther measures

• Boundary regionBoundary region

• Discernibility functionDiscernibility function

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

ExperimentationExperimentation

• Datasets corrupted with noiseDatasets corrupted with noise

• 10-fold cross validation with JRip10-fold cross validation with JRip

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Results: lowerResults: lower

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Results: boundaryResults: boundary

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Results: discernibilityResults: discernibility

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

ConclusionConclusion

• New approaches to fuzzy-rough feature New approaches to fuzzy-rough feature selection based on IVFSselection based on IVFS• Can handle missing values effectivelyCan handle missing values effectively

• Allows greater flexibility w.r.t. similarity relations Allows greater flexibility w.r.t. similarity relations

• Future workFuture work• Further investigationsFurther investigations

• Development and extension of other fuzzy-rough Development and extension of other fuzzy-rough methods to handle missing values – classifiers, methods to handle missing values – classifiers, clusterers etc.clusterers etc.

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

• WEKA implementations of all fuzzy-rough WEKA implementations of all fuzzy-rough feature selectors and classifiers can be feature selectors and classifiers can be downloaded from:downloaded from:

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

RSAR approximationsRSAR approximations

• Approximating a concept Approximating a concept XX using knowledge in using knowledge in PP• Lower approximation: contains objects that Lower approximation: contains objects that definitelydefinitely

belong to belong to XX

• Upper approximation: contains objects that Upper approximation: contains objects that possiblypossibly belong to belong to XX

}][:{ XxUxXPP

}][:{ XxUxXPP

Richard Jensen and Qiang ShenRichard Jensen and Qiang Shen

FRFSFRFS

• Based on fuzzy similarityBased on fuzzy similarity

• Lower/upper approximationsLower/upper approximations

|minmax|

|)()(|1),(

aa

yaxayx

aR

)},({),( yxyxaP R

PaR

))(),,((inf)( yyxIxXPP R

UyXR

))(),,((sup)( yyxTxXPP R

UyXR