ion of feature selection methods for web classification

Upload: hakan-oezpalamutcu

Post on 05-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    1/14

    Comparision of FeatureSelection Methods for Web

    Classification

    Hakan

    zpalamutcuMa s 2012

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    2/14

    Feature Selection

    Prepares data for data mining and machinelearning.

    Commonly used on high dimensional data.

    Studies how to select a subset or list ofattributes or variables that are used toconstruct models describing data.

    Purposes include reducing dimensionality,

    removing irrelevant and redundant features

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    3/14

    Feature Selection for Classification

    Select among a set of variables thesmallest subset that maximizesclassification performance

    a set of predictors features and aclass/category is given minimum set that achieves maximum

    classification performance is found

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    4/14

    Why Feature Selection isImportant? May improve performance of classification

    algorithm

    Classification algorithm may not scale up to

    the size of the full feature set either insample or time

    Allows us to better understand the domain

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    5/14

    Comparision Steps

    Choosing data set Preprocess of data Converting data to Weka format Applying feature selection methods to

    data

    Applying classification to data

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    6/14

    Choosing data set

    Category # of Documens

    Course 927

    Department 140

    Faculty 1124

    Other 3761

    Project 504

    Staff 137

    Student 1640

    Category

    # of Positive

    Document

    # of

    Negative

    Document

    Course 100 25

    Faculty 100 25

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    7/14

    Preprocess of Data

    Removal of HTML tags Removal of punctuation characters and

    numeric values

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    8/14

    Converting data to WEKA format

    Text2arff tool is used Stopwords are removed Min frequency is 100 Frequency calculated using tf-idf

    scheme

    Category

    # of Initial

    Attributes

    Course 76

    Faculty 93

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    9/14

    Applying feature selection methods todata

    Attribute Evaluators CfsSubsetEval

    ConsistencySubsetEval

    ClassifierSubsetEval

    Search Methods GeneticSearch

    BestFirst

    RankSearch

    Attribute evaluator Search method

    CfsSubsetEval GeneticSearch

    CfsSubsetEval BestFirst

    ConsistencySubsetEval RankSearch

    ConsistencySubsetEval BestFirst

    ClassifierSubsetEval GeneticSearch

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    10/14

    Applying feature selection methods todata

    Category Attribute evaluator Search method

    # of

    Features

    Selected

    Selected Features

    Course CfsSubsetEval GeneticSearch 18 3,6,9,13,14,19,34,40,42,43,45,48,49,52,59,65,69,70

    Course CfsSubsetEval BestFirst 12 3,6,13,18,19,42,43,45,48,64,67,70

    Course ConsistencySubsetEval RankSearch 12 6,14,18,19,40,42,43,45,48,64,67,70

    Course ConsistencySubsetEval BestFirst 7 3,6,13,19,42,48,70

    Course ClassifierSubsetEval GeneticSearch 6 2,27,31,33,73,75

    Category Attribute evaluator Search method

    # of

    Features

    Selected

    Selected Features

    Faculty CfsSubsetEval GeneticSearch 20

    2,6,12,16,19,27,42,43,53,56,58,61,65,67,73,74,76,84,

    90,92

    Faculty CfsSubsetEval BestFirst 3 16,43,74

    Faculty ConsistencySubsetEval RankSearch 3 16,43,74

    Faculty ConsistencySubsetEval BestFirst 3 16,43,74

    Faculty ClassifierSubsetEval GeneticSearch 10 1,3,35,47,49,59,64,67,81,90

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    11/14

    Applying classification to data

    Classifiers Naive Bayes (bayes) Class for a Naive Bayes classifier using

    estimator classes

    Bagging (meta) Class for bagging a classifier to reduce

    variance. Can do classification and regressiondepending on the base learner

    J48 (trees) Class for generating a pruned or unpruned

    C4.5 decision tree

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    12/14

    Results

    Quality of measures CCI-correctly classfied instances F-measure

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    13/14

    ResultsNaive Bayes J48 Bagging

    CategoryCCI F-Measure CCI F-Measure CCI F-Measure

    Course 106 0.857 121 0.967 107 0.821

    Before appliying feature selection

    After appliying feature selection

    CATEGORY:COURSE Classification

    Feature Selection Naive Bayes J48 Bagging

    Attribute evaluator Search method CCI F-Measure CCI F-Measure CCI F-Measure

    CfsSubsetEval GeneticSearch 106 0.853 120 0.959 104 0.779

    CfsSubsetEval BestFirst 108 0.867 118 0.938 112 0.884

    ConsistencySubsetEval RankSearch 100 0.801 118 0.938 112 0.881

    ConsistencySubsetEval BestFirst 105 0.840 109 0.855 110 0.863

    ClassifierSubsetEval GeneticSearch 104 0.813 119 0.947 105 0.794

  • 7/31/2019 ion of Feature Selection Methods for Web Classification

    14/14

    ResultsNaive Bayes J48 Bagging

    CategoryCCI F-Measure CCI F-Measure CCI F-Measure

    Faculty 99 0.811 121 0.967 114 0.902

    Before appliying feature selection

    After appliying feature selection

    CATEGORY:FACULTY Classification

    Feature Selection Naive Bayes J48 Bagging

    Attribute evaluator Search method CCI F-Measure CCI F-Measure CCI F-Measure

    CfsSubsetEval GeneticSearch 101 0.815 119 0.951 112 0.898

    CfsSubsetEval BestFirst 104 0.808 105 0.802 107 0.821

    ConsistencySubsetEval RankSearch 104 0.808 105 0.802 107 0.821

    ConsistencySubsetEval BestFirst 104 0.808 105 0.802 107 0.821

    ClassifierSubsetEval GeneticSearch 92 0.750 105 0.794 107 0.821