weka lab manual
Post on 28-Apr-2015
932 Views
Preview:
TRANSCRIPT
Data Mining Lab
S.K.T.R.M College of Engineering 1
LABORATORY MANUAL on
DATA MINING
Prepared by
INDRANEEL K Associate Professor
CSE Department
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI KOTTAM TULASI REDDY MEMORIAL COLLEGE OF ENGINEERING (Affiliated to JNTU, Hyderabad, Approved by AICTE, Accredited by NBA)
KONDAIR, MAHABOOBNAGAR (Dist), AP - 509125
Data Mining Lab
S.K.T.R.M College of Engineering 2
INDEX
The objective of the lab exercises is to use data mining techniques to identify customer segments and understand their buying behavior and to use standard databases available to understand DM processes using WEKA (or any other DM tool) 1. Gain insight for running pre- defined decision trees and explore results using MS OLAP Analytics. 2. Using IBM OLAP Miner – Understand the use of data mining for evaluating the content of multidimensional cubes. 3. Using Teradata Warehouse Miner – Create mining models that are executed in SQL. ( BI Portal Lab: The objective of the lab exercises is to integrate pre-built reports into a portal application ) 4. Publish cognos cubes to a business intelligence portal.
Metadata & ETL Lab: The objective of the lab exercises is to implement metadata import agents to pull metadata from leading business intelligence tools and populate a metadata repository. To understand ETL processes 5. Import metadata from specific business intelligence tools and populate a meta data repository. 6. Publish metadata stored in the repository. 7. Load data from heterogeneous sources including text files into a pre-defined warehouse schema
Data Mining Lab
S.K.T.R.M College of Engineering 3
CONTENTS
S.no Experiment Week NO Page NOs
1 Defining Weather relation for different attributes
1 7-18
2 Defining employee relation for different attributes
2 19-28
3 Defining labor relation for different attributes
3 29-38
4 Defining student relation for different attributes
4 39-49
5 Exploring weather relation using experimenter and obtaining results in various schemes
5 49-59
6 Exploring employee relation using experimenter
6 60-65
7 Exploring labor relation using experimenter
7 66-71
8 Exploring student relation using experimenter
8 72-78
9 Setting up a flow to load an arff file (batch mode) andperform a cross validation using J48
9 86-112
10 Design a knowledge flow layout, to load attribute selection normalize the attributes and to store the result in a csv saver.
10 116-117
Data Mining Lab
S.K.T.R.M College of Engineering 4
Aim: Implementation of Data Mining Algorithms by Attribute Relation File
formats Introduction to Weka (Data Mining Tool) • Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library).
• Tools (or functions) in Weka include: • Data preprocessing (e.g., Data Filters), • Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural
Networks, SVM), • Regression (e.g., Linear Regression, Isotonic Regression, SVM for
Regression), • Clustering (e.g., Simple K-means, Expectation Maximization (EM)), • Association rules (e.g., Apriori Algorithm, Predictive Accuracy,
Confirmation Guided), • Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chi-
squared Statistic), and • Visualization (e.g., View different two-dimensional plots of the data).
Launching WEKA The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Weka‘s main GUI applications and supporting tools. If one prefers a MDI (―multiple document interface‖) appearance, then this is provided by an alternative launcher called ―Main‖ (class weka.gui.Main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: • Explorer An environment for exploring data with WEKA (the rest of this
documentation deals with this application in more detail).
• Experimenter An environment for performing experiments and conducting statistical tests between learning schemes.
• Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning.
• Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.
Data Mining Lab
S.K.T.R.M College of Engineering 5
Working with Explorer
Weka Data File Format (Input) The most popular data input format of Weka
is ―arff‖ (with ―arff‖ being the extension name of your input data file).
Experiment:1
WEATHER RELATION:
% ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes
Data Mining Lab
S.K.T.R.M College of Engineering 6
PREPROCESSING: In order to experiment with the application, the data set needs to be
presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program.
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by the user
Data Mining Lab
S.K.T.R.M College of Engineering 7
CLASSIFICATION:
The user has the option of applying many different algorithms to the data set in order to produce a representation of information. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. Figure 5 shows some of the categories.
Output: correctly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4762 Root mean squared error 0.4934 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.178 yes 0 0 0 0 0 0.178 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 5 0 | b = no
CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or
clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those
Data Mining Lab
S.K.T.R.M College of Engineering 8
described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
== Run information === Output: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data === Model and evaluation on training set = Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111 windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instance0 14 (100%) Log likelihood: -9.4063
Choosing Relationship for cluster:
Data Mining Lab
S.K.T.R.M College of Engineering 9
ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F
\"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play
SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation
process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
OUTPUT:
Data Mining Lab
S.K.T.R.M College of Engineering 10
=== Run information === Evaluator: weka.attributeSelection.CfsSubsetEval Search: weka.attributeSelection.BestFirst -D 1 -N 5 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windyplay Evaluation mode: evaluate on all training data === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196
Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes
Selected attributes: 1,4 : 2 outlook windy
VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the
program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.
Data Mining Lab
S.K.T.R.M College of Engineering 11
Data Mining Lab
S.K.T.R.M College of Engineering 12
Experiment :2 Employee Relation(INPUT): % ARFF file for employee data with some numeric features @relation employee @attribute ename {john, tony, ravi} @attribute eid numeric @attribute esal numeric @attribute edept {sales, admin} @data john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales
OUTPUT
PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the
Data Mining Lab
S.K.T.R.M College of Engineering 13
type of data that WEKA will accept and three options for loading data into the program.
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information ===
OUTPUT: Scheme: weka.classifiers.rules.ZeroR Relation: employee Instances: 3 Attributes: 4 ename eid esal edept
Data Mining Lab
S.K.T.R.M College of Engineering 14
Test mode: 10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: sales Time taken to build model: 0 seconds
CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or
clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
OUTPUT: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: evaluate on training data === Model and evaluation on training set === EM == Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== ename john 3 tony 2 ravi 1 [total] 6 eid mean 85 std. dev. 0 esal mean 8833.3333 std. dev. 471.4045 edept sales 3 admin 2 [total] 5 Clustered Instances 0 3 (100%)
Data Mining Lab
S.K.T.R.M College of Engineering 15
Log likelihood: 3.84763
ASSOCIATION: The associate tab opens a window to select the options for associations
within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F
\"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: employee Instances: 3 Attributes: 4 ename eid esal edept
SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation
process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
OUTPUT: === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy
VISUALIZATION:
Data Mining Lab
S.K.T.R.M College of Engineering 16
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.
Data Mining Lab
S.K.T.R.M College of Engineering 17
Experiment:3 STUDENT RELATION % % ARFF file for student data with some numeric features % @relation student @attribute sname {john, tony, ravi} @attribute sid numeric @attribute sbranch {ECE, CSE, IT} @attribute sage numeric @data john, 285, ECE, 19 tony, 385, IT, admin john, 485, ECE, 19
PREPROCESSING: In order to experiment with the application, the data set needs to be presented to
WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program.
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze.
Data Mining Lab
S.K.T.R.M College of Engineering 18
There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
Output: Scheme: weka.classifiers.rules.ZeroR Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 19.333333333333332 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient -0.5 Mean absolute error 0.5 Root mean squared error 0.6455 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 3
CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
heme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM
Data Mining Lab
S.K.T.R.M College of Engineering 19
==Number of clusters selected by cross validation Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instances 0 14 (100%) Log likelihood: -9.4063 ASSOCIATION: The associate tab opens a window to select the options for associations within the data
set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F
\"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage
SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation process. By
default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of
Data Mining Lab
S.K.T.R.M College of Engineering 20
them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 7 Merit of best subset found: 1 Attribute Subset Evaluator (supervised, Class (numeric): 4 sage): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,3 : 2 sname sbranch
VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.
Data Mining Lab
S.K.T.R.M College of Engineering 21
Data Mining Lab
S.K.T.R.M College of Engineering 22
Experiment:4 % LABOR RELATION: % ARFF file for labor data with some numeric features % @relation labor @attribute name {rom, tony, santu} @attribute wage-increase-first-year numeric @attribute wage-increase-second-year numeric @attribute working-hours numeric @attribute pension numeric @attribute vacation numeric @data rom, 500, 600, 8, 200, 15 tony, 400, 450, 8, 200, 15 santu, 600, 650, 8, 200, 15
PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA
in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by the user
CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze.
Data Mining Lab
S.K.T.R.M College of Engineering 23
There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
Output: Scheme: weka.classifiers.rules.ZeroR Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 15.0 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient 0 Mean absolute error 0 Root mean squared error 0 Relative absolute error NaN % Root relative squared error NaN % Total Number of Instances 3
CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.
Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: labor Instances: 3
Data Mining Lab
S.K.T.R.M College of Engineering 24
Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: evaluate on training data === Model and evaluation on training set === EM== Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ===================================== name rom 2 tony 2 santu 2 [total] 6 wage-increase-first-year mean 500 std. dev. 81.6497 wage-increase-second-year mean 566.6667 std. dev. 84.9837 working-hours mean 8 std. dev. 0 pension mean 200 std. dev. 0 vacation mean 15 std. dev. 0 Clustered Instances 0 3 (100%) Log likelihood: 25.90833
ASSOCIATION: The associate tab opens a window to select the options for associations within the data
set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.
Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Data Mining Lab
S.K.T.R.M College of Engineering 25
Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation
SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation process. By
default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.
=== Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 19 Merit of best subset found: 0 Attribute Subset Evaluator (supervised, Class (numeric): 6 vacation): CFS Subset EvaluatIncluding locally predictive attributes Selected attributes: 1 : 1 name VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program,
calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the
Data Mining Lab
S.K.T.R.M College of Engineering 26
attributes from one view to
EXPERIMENTER: The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes.The Experiment Environment can be run from the command line using the Simple CLI
Data Mining Lab
S.K.T.R.M College of Engineering 27
Experiment:5
COMMAND LINE:java weka.experiment.Experiment -r -T data/weather.arff
Defining an Experiment
When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment.
To define the dataset to be processed by a scheme, first select ―Use relative paths‖ in the Datasetspanel of the Setup window and then click ―‖Add New‖ to open a dialog box below
Data Mining Lab
S.K.T.R.M College of Engineering 28
Select iris.arff and click Open to select the iris dataset.
The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment
To identify a dataset to which the results are to be sent, click on the ―CSVResultListener‖ entry in the Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear.
Data Mining Lab
S.K.T.R.M College of Engineering 29
The output file parameter is near the bottom of the window, beside the text ―outputFile‖. Click on this parameter to display a file selection window.
Data Mining Lab
S.K.T.R.M College of Engineering 30
Type the name of the output file, click Select, and then click close (x). The file name is displayed in the outputFile panel. Click on OK to close the window.
The dataset name is displayed in the Destination panel of the Setup window.
Saving the Experiment Definition
The experiment definition can be saved at any time. Select ―Save …‖ at the top of the Setup window. Type the dataset name with the extension ―exp‖ (or select the dataset name if the experiment definition dataset already exists).
Data Mining Lab
S.K.T.R.M College of Engineering 31
The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. Running an Experiment
To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme.
Click Start to run the experiment.
Data Mining Lab
S.K.T.R.M College of Engineering 32
If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?
Data Mining Lab
S.K.T.R.M College of Engineering 33
Experiment:6
Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for employee relation
Type this command in simple CLI java weka.experiment.Experiment -r -T data/emp.arff
Add new relation using add new button on the right panel And give database connection using jdbc and click ok
Choose the relation and click ok button
Data Mining Lab
S.K.T.R.M College of Engineering 34
Choose ZERO R from the menu ―choose‖ button by clicking add new button on the right panel and click ok
Click on the run tab to get the output
Data Mining Lab
S.K.T.R.M College of Engineering 35
The results of the experiment are saved to the dataset Experiment2.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?
Data Mining Lab
S.K.T.R.M College of Engineering 36
Experiment:7
Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for labor relation
Type this command in simple CLI java weka.experiment.Experiment -r -T data/labor.arff
Add new relation using add new button on the right panel And give database connection using jdbc and click ok
Data Mining Lab
S.K.T.R.M College of Engineering 37
Choose the relation and click ok button
Data Mining Lab
S.K.T.R.M College of Engineering 38
Choose ZERO R from the menu ―choose‖ button by clicking add new button on the right panel and click ok
Click on the run tab to get the output
Data Mining Lab
S.K.T.R.M College of Engineering 39
The results of the experiment are saved to the dataset Experiment3.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?
Data Mining Lab
S.K.T.R.M College of Engineering 40
Experiment:8
Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for student relation
Type this command in simple CLI java weka.experiment.Experiment -r -T data/student.arff
Data Mining Lab
S.K.T.R.M College of Engineering 41
Add new relation using add new button on the right panel And give database connection using jdbc and click ok
Choose the relation and click ok button
Data Mining Lab
S.K.T.R.M College of Engineering 42
Choose ZERO R from the menu ―choose‖ button by clicking add new button on the right panel and click ok
Click on the run tab to get the output
Data Mining Lab
S.K.T.R.M College of Engineering 43
The results of the experiment are saved to the dataset Experiment4.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?
KNOWLEDGE FLOW: The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer.
Data Mining Lab
S.K.T.R.M College of Engineering 44
The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Features of the KnowledgeFlow: * intuitive data flow style layout * process data in batches or incrementally * process multiple batches or streams in parallel! (each separate flow executes in its own thread) * chain filters together * view models produced by classifiers for each fold in a cross validation * visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) omponents available in the KnowledgeFlow: DataSources: All of Weka's loaders are available DataSinks: All of Weka's savers are available Filters: All of Weka's filters are available Classifiers: All of Weka's classifiers are available Clusterers: All of Weka's clusterers are available valuation: TrainingSetMaker - make a data set into a training set TestSetMaker - make a data set into a test set CrossValidationFoldMaker - split any data set, training set or test set into folds TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set ClassAssigner - assign a column to be the class for any data set, training set or test set ClassValuePicker - choose a class value to be considered as the "positive" class. This is useful when generating data for ROC style curves (see below) ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers
Data Mining Lab
S.K.T.R.M College of Engineering 45
IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions Visualization: DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot) AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental classifiers)
Launching the KnowledgeFlow The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "KnowledgeFlow" to start the KnowledgeFlow. Alternatively, you can launch the KnowledgeFlow from a terminal window by typing "javaweka.gui.beans.KnowledgeFlow". At the top of the KnowledgeFlow window is are seven tabs: DataSources, DataSinks, Filters, Classifiers, Clusterers, Evaluation and Visualization. The names are pretty much self explanatory.
Components Components available in the KnowledgeFlow:
DataSources All of WEKA‘s loaders are available.
Data Mining Lab
S.K.T.R.M College of Engineering 46
DataSinks All of WEKA‘s savers are available.
Filters All of WEKA‘s filters are available.
Classifiers All of WEKA‘s classifiers are available.
Clusterers All of WEKA‘s clusterers are available.
Data Mining Lab
S.K.T.R.M College of Engineering 47
Evaluation
• TrainingSetMaker - make a data set into a training set. • TestSetMaker - make a data set into a test set. • CrossValidationFoldMaker - split any data set, training set or test set into folds. • TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. • ClassAssigner - assign a column to be the class for any data set, training set or test set. • ClassValuePicker - choose a class value to be considered as the ―positive‖ class. This is useful when generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). • ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. • IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. • ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. • PredictionAppender - append classifier predictions to a test set. For dis- crete class problems, can either append predicted class labels or probabil- ity distributions.
Visualization
• DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. • ScatterPlotMatrix - component that can pop up a panel containing a ma- trix of small scatter plots (clicking on a small plot pops up a large scatter plot). • AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. • ModelPerformanceChart - component that can pop up a panel for visual- izing threshold (i.e. ROC style) curves.
Data Mining Lab
S.K.T.R.M College of Engineering 48
• TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. • GraphViewer - component that can pop up a panel for visualizing tree based models. • StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental clas-iers)
Experiment:9
Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation). First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs").
Data Mining Lab
S.K.T.R.M College of Engineering 49
Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.
Alternatively, you can
Data Mining Lab
S.K.T.R.M College of Engineering 50
double-click on the icon to bring up the configuration dialog
Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout.
Now connect the ArffLoader to the ClassAssigner: first right click
Data Mining Lab
S.K.T.R.M College of Engineering 51
over the ArffLoader and select the "dataSet" under "Connections" in the menu. A "rubber band" line will appear.
Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will connect the two components.
Data Mining Lab
S.K.T.R.M College of Engineering 52
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).
Data Mining Lab
S.K.T.R.M College of Engineering 53
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).
Data Mining Lab
S.K.T.R.M College of Engineering 54
Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the layout.
Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner" and selecting "dataSet" from under "Connections" in the menu.
Data Mining Lab
S.K.T.R.M College of Engineering 55
Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section.
Place a J48 component on the layout.
Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then "testSet" from the pop-up menu for the CrossValidationFoldMaker.
Data Mining Lab
S.K.T.R.M College of Engineering 56
Data Mining Lab
S.K.T.R.M College of Engineering 57
Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the layout.
Data Mining Lab
S.K.T.R.M College of Engineering 58
Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.
Data Mining Lab
S.K.T.R.M College of Engineering 59
Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.
Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the pop-up menu for ClassifierPerformanceEvaluator.
Data Mining Lab
S.K.T.R.M College of Engineering 60
Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.
Data Mining Lab
S.K.T.R.M College of Engineering 61
When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component.
Data Mining Lab
S.K.T.R.M College of Engineering 62
Data Mining Lab
S.K.T.R.M College of Engineering 63
Simple CSI
The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters, clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Weka was started). It offers a simple Weka shell with separated commandline and output.
Commands The following commands are available in the Simple CLI: • java <classname> [<args>] invokes a java class with the given arguments (if any) • breakstops the current thread, e.g., a running classifier, in a friendly manner 31 32 CHAPTER 3. SIMPLE CLI • kill stops the current thread in an unfriendly fashion • cls clears the output area • exit exits the Simple CLI • help [<command>] provides an overview of the available commands if without a command name as argument, otherwise more help on the specified command
Data Mining Lab
S.K.T.R.M College of Engineering 64
Commands The following commands are available in the Simple CLI: • java <classname> [<args>] invokes a java class with the given arguments (if any) • break stops the current thread, e.g., a running classifier, in a friendly manner SIMPLE CLI • kill stops the current thread in an unfriendly fashion • cls clears the output area • exit exits the Simple CLI • help [<command>] provides an overview of the available commands if without a comman
Command redirection Starting with this version of Weka one can perform a basic redirection: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space, otherwise it is not recognized as redirection, but part of another parameter.
Command completion Commands starting with java support completion for classnames and filenames
Data Mining Lab
S.K.T.R.M College of Engineering 65
via Tab (Alt+BackSpace deletes parts of the command again). In case that there are several matches, Weka lists all possible matches. • package name completion java weka.cl<Tab> results in the following output of possible matches of package names: Possible matches: weka.classifiers weka.clusterers • classname completion java weka.classifiers.meta.A<Tab> lists the following classes Possible matches: weka.classifiers.meta.AdaBoostM1 weka.classifiers.meta.AdditiveRegression weka.classifiers.meta.AttributeSelectedClassifier • filename completion In order for Weka to determine whether a the string under the cursor is a classname or a filename, filenames need to be absolute (Unix/Linx: /some/path/file;Windows: C:\Some\Path\file) or relative and starting with a dot (Unix/Linux: ./some/other/path/file;Windows: .\Some\Other\Path\file)
Data Mining Lab
S.K.T.R.M College of Engineering 66
EXPERMIENT-10
AIM:
To design a knowledge flow layout, to load apply attribute selection normalize the
attributes and to store the result in a csv saver. Procedure:
1) Click on ―knowledge Glow‖ from weak GUI chooser. 2) It opens a window called ―Weka knowledge flow environment‖. 3) Click on ―data sources‖ and select ―Arff‖ to read data is the arff source. 4) Now click on the ―knowledge flow layout‖ area, which laces the Arffloader in the
layout. 5) Cdlick on ―filters‖ and select on attribute selector from the ―supervised‖ filters.
Place it on the design layout. 6) Now select another filter to normalize the numeric attribute values , from the
―unsupervised‖ filters. Placae it on the design layout. 7) Click on ―Data sinks‖ and choose ―csv‖, which writes to a destination that is in csv
format. Place it on the design layout of knowledge flow. 8) Now right click on ―Arffloader‖ and click on data set to direct the flow to ―attribute
selection‖. 9) Now right click on ―Attribute selection‖ and select data set to direct the flow to
―Normalize‖ from which ;lthe flow is directed to the csv saver in the same way. 10) Right click on csv saver and click on ―configure‖, to specify the destination where
to sotre the results let at be selected as z:\weka @ ravi. 11) Now right click on ―Affloader‖ and select ―configure to specify the ―source data‖.
Let ―in‘s‖ relation has been selected as so. 12) Now again right click on the ―Affloader‖ and click on ―start loading‖ which results
in the below ―knowledge flow layout‖. 13) We can observe the results of lthe abouve process by opening the file
z:\Weka@ravi\in‘s-weka.filters.supervised.attribute…Microsoft office Excellomma… in notepad, which displays the results I a comma separated value form
Petal length, Petal width Class 0.067797 0.041667 In‘s-setosa 0.067797 0.041667 In‘s-setosa 0.050847 0.041667 In‘s-setosa 0.627119 0.541667 In‘s-versicolor 0.830508 0.833333 In,s-virginica 0.677966 0.791667 In‘s-virginica
Data Mining Lab
S.K.T.R.M College of Engineering 67
Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:
Structure of ARFF Format:
%comment lines @relation relation name @attribute attribute name @Data Set of data items separated by commas. % 1. Title: German Credit data % % 2. Source Information % % Professor Dr. Hans Hofmann % Institut f"ur Statistik und "Okonometrie % Universit"at Hamburg % FB Wirtschaftswissenschaften % Von-Melle-Park 5 % 2000 Hamburg 13 % % 3. Number of Instances: 1000 % % Two datasets are provided. the original dataset, in the form provided % by Prof. Hofmann, contains categorical/symbolic attributes and % is in the file "german.data". % % For algorithms that need numerical attributes, Strathclyde University % produced the file "german.data-numeric". This file has been edited % and several indicator variables added to make it suitable for % algorithms which cannot cope with categorical variables. Several % attributes that are ordered categorical (such as attribute 17) have % been coded as integer. This was the form used by StatLog. % % % 6. Number of Attributes german: 20 (7 numerical, 13 categorical) % Number of Attributes german.numer: 24 (24 numerical) % % % 7. Attribute description for german % % Attribute 1: (qualitative) % Status of existing checking account % A11 : ... < 0 DM % A12 : 0 <= ... < 200 DM % A13 : ... >= 200 DM / % salary assignments for at least 1 year % A14 : no checking account
Data Mining Lab
S.K.T.R.M College of Engineering 68
% Attribute 2: (numerical) % Duration in month % % Attribute 3: (qualitative) % Credit history % A30 : no credits taken/ % all credits paid back duly % A31 : all credits at this bank paid back duly % A32 : existing credits paid back duly till now % A33 : delay in paying off in the past % A34 : critical account/ % other credits existing (not at this bank) % % Attribute 4: (qualitative) % Purpose % A40 : car (new) % A41 : car (used) % A42 : furniture/equipment % A43 : radio/television % A44 : domestic appliances % A45 : repairs % A46 : education % A47 : (vacation - does not exist?) % A48 : retraining % A49 : business % A410 : others % % Attribute 5: (numerical) % Credit amount % % Attibute 6: (qualitative) % Savings account/bonds % A61 : ... < 100 DM % A62 : 100 <= ... < 500 DM % A63 : 500 <= ... < 1000 DM % A64 : .. >= 1000 DM % A65 : unknown/ no savings account % % Attribute 7: (qualitative) % Present employment since % A71 : unemployed % A72 : ... < 1 year % A73 : 1 <= ... < 4 years % A74 : 4 <= ... < 7 years % A75 : .. >= 7 years % % Attribute 8: (numerical) % Installment rate in percentage of disposable income
Data Mining Lab
S.K.T.R.M College of Engineering 69
% % Attribute 9: (qualitative) % Personal status and sex % A91 : male : divorced/separated % A92 : female : divorced/separated/married % A93 : male : single % A94 : male : married/widowed % A95 : female : single % % Attribute 10: (qualitative) % Other debtors / guarantors % A101 : none % A102 : co-applicant % A103 : guarantor % % Attribute 11: (numerical) % Present residence since % % Attribute 12: (qualitative) % Property % A121 : real estate % A122 : if not A121 : building society savings agreement/ % life insurance % A123 : if not A121/A122 : car or other, not in attribute 6 % A124 : unknown / no property % % Attribute 13: (numerical) % Age in years % % Attribute 14: (qualitative) % Other installment plans % A141 : bank % A142 : stores % A143 : none % % Attribute 15: (qualitative) % Housing % A151 : rent % A152 : own % A153 : for free % % Attribute 16: (numerical) % Number of existing credits at this bank % % Attribute 17: (qualitative) % Job % A171 : unemployed/ unskilled - non-resident % A172 : unskilled - resident % A173 : skilled employee / official % A174 : management/ self-employed/
Data Mining Lab
S.K.T.R.M College of Engineering 70
% highly qualified employee/ officer % % Attribute 18: (numerical) % Number of people being liable to provide maintenance for % % Attribute 19: (qualitative) % Telephone % A191 : none % A192 : yes, registered under the customers name % % Attribute 20: (qualitative) % foreign worker % A201 : yes % A202 : no % % % % 8. Cost Matrix % % This dataset requires use of a cost matrix (see below) % % % 1 2 % ---------------------------- % 1 0 1 % ----------------------- % 2 5 0 % % (1 = Good, 2 = Bad) % % the rows represent the actual classification and the columns % the predicted classification. % % It is worse to class a customer as good when they are bad (5), % than it is to class a customer as bad when they are good (1). % % % % % % Relabeled values in attribute checking_status % From: A11 To: '<0' % From: A12 To: '0<=X<200' % From: A13 To: '>=200' % From: A14 To: 'no checking' % % % Relabeled values in attribute credit_history % From: A30 To: 'no credits/all paid' % From: A31 To: 'all paid' % From: A32 To: 'existing paid' % From: A33 To: 'delayed previously' % From: A34 To: 'critical/other existing credit'
Data Mining Lab
S.K.T.R.M College of Engineering 71
% % % Relabeled values in attribute purpose % From: A40 To: 'new car' % From: A41 To: 'used car' % From: A42 To: furniture/equipment % From: A43 To: radio/tv % From: A44 To: 'domestic appliance' % From: A45 To: repairs % From: A46 To: education % From: A47 To: vacation % From: A48 To: retraining % From: A49 To: business % From: A410 To: other % % % Relabeled values in attribute savings_status % From: A61 To: '<100' % From: A62 To: '100<=X<500' % From: A63 To: '500<=X<1000' % From: A64 To: '>=1000' % From: A65 To: 'no known savings' % % % Relabeled values in attribute employment % From: A71 To: unemployed % From: A72 To: '<1' % From: A73 To: '1<=X<4' % From: A74 To: '4<=X<7' % From: A75 To: '>=7' % % % Relabeled values in attribute personal_status % From: A91 To: 'male div/sep' % From: A92 To: 'female div/dep/mar' % From: A93 To: 'male single' % From: A94 To: 'male mar/wid' % From: A95 To: 'female single' % % % Relabeled values in attribute other_parties % From: A101 To: none % From: A102 To: 'co applicant'
% From: A103 To: guarantor % % % Relabeled values in attribute property_magnitude % From: A121 To: 'real estate' % From: A122 To: 'life insurance'
Data Mining Lab
S.K.T.R.M College of Engineering 72
% From: A123 To: car % From: A124 To: 'no known property' % % % Relabeled values in attribute other_payment_plans % From: A141 To: bank % From: A142 To: stores % From: A143 To: none % % % Relabeled values in attribute housing % From: A151 To: rent % From: A152 To: own % From: A153 To: 'for free' % % % Relabeled values in attribute job % From: A171 To: 'unemp/unskilled non res' % From: A172 To: 'unskilled resident' % From: A173 To: skilled % From: A174 To: 'high qualif/self emp/mgmt' % % % Relabeled values in attribute own_telephone % From: A191 To: none % From: A192 To: yes % % % Relabeled values in attribute foreign_worker % From: A201 To: yes % From: A202 To: no % % % Relabeled values in attribute class % From: 1 To: good % From: 2 To: bad % @relation german_credit
@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute duration real
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed
previously', 'critical/other existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic
appliance', repairs, education, vacation, retraining, business, other}
@attribute credit_amount real
@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known
savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute installment_commitment real
Data Mining Lab
S.K.T.R.M College of Engineering 73
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male
mar/wid', 'female single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since real
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute age real
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits real
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self
emp/mgmt'}
@attribute num_dependents real
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male
single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female
div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male
single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male
single',guarantor,4,'life insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no
known property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male
single',none,4,'no known property',35,none,'for free',1,'unskilled
resident',2,yes,yes,good
Data Mining Lab
S.K.T.R.M College of Engineering 74
Lab Experiments
1. List all the categorical (or nominal) attributes and the real-valued
attributes separately.
From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment:
Total Valid Attributes
Categorical or Nominal
attributes(which takes
True/false, etc values)
Real valued attributes
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
14. installment plans
15. housing
16. existing credits
17. job
18. num_dependents
19. telephone
20. foreign worker
1. checking_status
2. credit history
3. purpose
4. savings_status
5. employment
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
13. foreign worker
1. duration
2. credit amount
3. credit amount
4. residence
5. age
6. existing credits
7. num_dependents
Data Mining Lab
S.K.T.R.M College of Engineering 75
2. What attributes do you think might be crucial in making the
credit assessment? Come up with some simple rules in plain
English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk
assessment.
1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit
Based on the above attributes, we can make a decision whether to give credit or not.
checking_status = no checking AND other_payment_plans = none AND
credit_history = critical/other existing credit: good
checking_status = no checking AND existing_credits <= 1 AND
other_payment_plans = none AND purpose = radio/tv: good
checking_status = no checking AND foreign_worker = yes AND
employment = 4<=X<7: good
foreign_worker = no AND personal_status = male single: good
checking_status = no checking AND purpose = used car AND
other_payment_plans = none: good
duration <= 15 AND other_parties = guarantor: good
duration <= 11 AND credit_history = critical/other existing credit: good
checking_status = >=200 AND num_dependents <= 1 AND
property_magnitude = car: good
checking_status = no checking AND property_magnitude = real estate AND
other_payment_plans = none AND age > 23: good
savings_status = >=1000 AND property_magnitude = real estate: good
savings_status = 500<=X<1000 AND employment = >=7: good
credit_history = no credits/all paid AND housing = rent: bad
savings_status = no known savings AND checking_status = 0<=X<200 AND
existing_credits > 1: good
Data Mining Lab
S.K.T.R.M College of Engineering 76
checking_status = >=200 AND num_dependents <= 1 AND
property_magnitude = life insurance: good
installment_commitment <= 2 AND other_parties = co applicant AND
existing_credits > 1: bad
installment_commitment <= 2 AND credit_history = delayed previously AND
existing_credits > 1 AND
residence_since > 1: good
installment_commitment <= 2 AND credit_history = delayed previously AND
existing_credits <= 1: good
duration > 30 AND savings_status = 100<=X<500: bad
credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad
duration > 30 AND savings_status = no known savings AND num_dependents > 1: good
duration > 30 AND credit_history = delayed previously: bad
duration > 42 AND savings_status = <100 AND
residence_since > 1: bad
Data Mining Lab
S.K.T.R.M College of Engineering 77
3. One type of model that you can create is a Decision Tree - train a
Decision Tree using the complete dataset as the training data.
Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class label.
Decision trees can be easily converted into classification rules.
e.g. ID3,C4.5 and CART.
J48 pruned tree
1. Using WEKA Tool, we can generate a decision tree by selecting the ―classify
tab‖.
2. In classify tab select choose option where a list of different decision trees are
available. From that list select J48.
3. Now under test option ,select training data test option.
4. The resulting window in WEKA is as follows:
Data Mining Lab
S.K.T.R.M College of Engineering 78
5. To generate the decision tree, right click on the result list and select visualize
tree option by which the decision tree will be generated.
6. The obtained decision tree for credit risk assessment is very large to fit on the
screen.
7. The decision tree above is unclear due to a large number of attributes.
Data Mining Lab
S.K.T.R.M College of Engineering 79
4. Suppose you use your above model trained on the complete
dataset, and classify credit good/bad for each of the examples in
the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you
cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset.
For example:
IF purpose=vacation THEN
credit=bad;
ELSE purpose=business THEN
credit=good;
In this way we classified each of the examples in the dataset.
We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified. We can‘t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can‘t get 100% training
accuracy.
Data Mining Lab
S.K.T.R.M College of Engineering 80
5. Is testing on the training set as you did above a good idea? Why
Why not?
Bad idea, if take all the data into training set. Then how to test the above classification is correctly
or not ?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as
training set and the remaining 1/3 as test set. But here in the above model we have
taken complete dataset as training set which results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not
make a crucial role in credit risk assessment. And by this complexity is increasing and
finally it leads to the minimum accuracy. If some part of the dataset is used as a training
set and the remaining as test set then it leads to the accurate results and the time for
computation will be less.
This is why, we prefer not to take complete dataset as training set.
UseTraining Set Result for the table GermanCreditData:
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
Data Mining Lab
S.K.T.R.M College of Engineering 81
6. One approach for solving the problem encountered in the previous
question is using cross-validation? Describe what cross-
validation is briefly. Train a Decision Tree again using cross-validation and report your results. Does your accuracy
increase/decrease? Why?
Cross validation:-
In k-fold cross-validation, the initial data are randomly portioned into ‗k‘ mutually exclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed ‗k‘ times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model.
That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the
subsets D1, D3, . . . . . ., Dk and test on the D2 and so on….
1. Select classify tab and J48 decision tree and in the test option select cross
validation radio button and the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the
errors will be zeroed out, but in reality there is no such training set that gives 100%
accuracy.
Data Mining Lab
S.K.T.R.M College of Engineering 82
Cross Validation Result at folds: 10 for the table GermanCreditData:
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Total Number of Instances 1000
Here there are 1000 instances with 100 instances per partition.
Cross Validation Result at folds: 20 for the table GermanCreditData:
Correctly Classified Instances 698 69.8 %
Incorrectly Classified Instances 302 30.2 %
Kappa statistic 0.2264
Mean absolute error 0.3571
Root mean squared error 0.4883
Relative absolute error 85.0006 %
Root relative squared error 106.5538 %
Total Number of Instances 1000
Cross Validation Result at folds: 50 for the table GermanCreditData:
Correctly Classified Instances 709 70.9 %
Incorrectly Classified Instances 291 29.1 %
Kappa statistic 0.2538
Mean absolute error 0.3484
Root mean squared error 0.4825
Data Mining Lab
S.K.T.R.M College of Engineering 83
Relative absolute error 82.9304 %
Root relative squared error 105.2826 %
Total Number of Instances 1000
Cross Validation Result at folds: 100 for the table GermanCreditData:
Correctly Classified Instances 710 71 %
Incorrectly Classified Instances 290 29 %
Kappa statistic 0.2587
Mean absolute error 0.3444
Root mean squared error 0.4771
Relative absolute error 81.959 %
Root relative squared error 104.1164 %
Total Number of Instances 1000
Percentage split does not allow 100%, it allows only till 99.9%
Data Mining Lab
S.K.T.R.M College of Engineering 84
Percentage Split Result at 50%:
Correctly Classified Instances 362 72.4 %
Incorrectly Classified Instances 138 27.6 %
Kappa statistic 0.2725
Mean absolute error 0.3225
Root mean squared error 0.4764
Relative absolute error 76.3523 %
Root relative squared error 106.4373 %
Total Number of Instances 500
Data Mining Lab
S.K.T.R.M College of Engineering 85
Percentage Split Result at 99.9%:
Correctly Classified Instances 0 0 %
Incorrectly Classified Instances 1 100 %
Kappa statistic 0
Mean absolute error 0.6667
Root mean squared error 0.6667
Relative absolute error 221.7054 %
Root relative squared error 221.7054 %
Total Number of Instances 1
Data Mining Lab
S.K.T.R.M College of Engineering 86
7. Check to see if the data shows a bias against "foreign workers"
(attribute 20), or "personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes from
the dataset and see if the decision tree created in those cases is
significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess
tab in WEKA's GUI Explorer. Did removing these attributes have
any significant effect? Discuss.
This increases in accuracy because the two attributes ―foreign workers‖ and ―personal status ―are not much important in training and analyzing. By removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have
trained now. This is the main difference between these two decision trees.
After forign worker is removed, the accuracy is increased to 85.9%
Data Mining Lab
S.K.T.R.M College of Engineering 87
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that
these two attributes are not significant to perform training.
Data Mining Lab
S.K.T.R.M College of Engineering 88
Cross validation after removing 9th attribute.
Percentage split after removing 9th attribute.
Data Mining Lab
S.K.T.R.M College of Engineering 89
After removing the 20th attribute, the cross validation is as above.
After removing 20th attribute, the percentage split is as above.
Data Mining Lab
S.K.T.R.M College of Engineering 90
8. Another question might be, do you really need to input so many
attributes to get good results? Maybe only a few would do. For
example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some
combinations. (You had removed two attributes in problem 7
Remember to reload the ARFF data file to get all the attributes
initially before you start selecting the ones you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
Here accuracy is decreased.
Select random attributes and then check the accuracy.
Data Mining Lab
S.K.T.R.M College of Engineering 91
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left
over attributes and visualize them.
Data Mining Lab
S.K.T.R.M College of Engineering 92
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can
further try random combination of attributes to increase the accuracy.
Cross validation
Data Mining Lab
S.K.T.R.M College of Engineering 93
Percentage split
Data Mining Lab
S.K.T.R.M College of Engineering 94
9. Sometimes, the cost of rejecting an applicant who actually has a
good credit
Case 1. might be higher than accepting an applicant who has bad
credit
Case 2. Instead of counting the misclassifications equally in both
cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a
cost matrix in WEKA.
Train your Decision Tree again and report the Decision Tree and
cross-validation results. Are they significantly different from
results obtained in problem 6 (using equal cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in
case 2.
When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1
(cost 5) Case2 (cost 5)
Total Cost 3820 1705
Average cost 3.82 1.705
We don‘t find this cost factor in problem 6. As there we use equal cost. This is the
major difference between the results of problem 6 and problem 9.
The cost matrices we used here:
Case 1: 5 1
1 5
Case 2: 2 1
1 2
Data Mining Lab
S.K.T.R.M College of Engineering 95
1.Select classify tab.
2. Select More Option from Test Option.
Data Mining Lab
S.K.T.R.M College of Engineering 96
3.Tick on cost sensitive Evaluation and go to set.
4.Set classes as 2.
5.Click on Resize and then we‘ll get cost matrix. 6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 7.Then confusion matrix will be generated and you can find out the difference between good and bad attribute.
8.Check accuracy whether it‘s changing or not.
Data Mining Lab
S.K.T.R.M College of Engineering 97
10. Do you think it is a good idea to prefer simple decision trees
instead of having long complex decision trees? How does the
complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect.
This problem can be reduced by considering simple decision tree. The attributes will be
less and it decreases the bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
1. Open any existing ARFF file e.g labour.arff.
2. In preprocess tab, select ALL to select all the attributes.
3. Go to classify tab and then use traning set with J48 algorithm.
Data Mining Lab
S.K.T.R.M College of Engineering 98
4. To generate the decision tree, right click on the result list and select visualize tree
option, by which the decision tree will be generated.
Data Mining Lab
S.K.T.R.M College of Engineering 99
5. Right click on J48 algorithm to get Generic Object Editor window
6. In this,make the unpruned option as true .
7. Then press OK and then start. we find the tree will become more complex if not
pruned.
Visualizetree
Data Mining Lab
S.K.T.R.M College of Engineering 100
8. The tree has become more complex.
Data Mining Lab
S.K.T.R.M College of Engineering 101
11. You can make your Decision Trees simpler by pruning the node
s. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees
using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the
pruned model. Does your accuracy increase?
Reduced-error pruning:-
The idea of using a separate pruning set for pruning—which is applicable to decision trees as well as rule sets—is called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental
reduced-error pruning.
Another possibility is to build a full, unpruned rule set first, pruning it afterwards by
discarding individual tests.
However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only
rule in the theory, operating under the closed world assumption.
If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n negative ones, where n = t – p is the number of negative instances that the rule covers and N = T - P is the total number of
negative instances.
Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning.
1. Right click on J48 algorithm to get Generic Object Editor window 2. In this,make reduced error pruning option as true and also the unpruned option as true . 3. Then press OK and then start.
4. We find that the accuracy has been increased by selecting the reduced error pruning option.
Data Mining Lab
S.K.T.R.M College of Engineering 102
Data Mining Lab
S.K.T.R.M College of Engineering 103
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules".
Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers
that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules
obtained. Sometimes just one attribute can be good enough in
making the decision, yes, just one! Can you predict what attribute
that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on
minimum error). Report the rule obtained by training a one R
classifier. Rank the performance of j48, PART and oneR.
In WEKA, rules.PART is one of the classifier which converts the decision trees into ―IF-THEN-ELSE‖ rules.
Converting Decision trees into “IF-THEN-ELSE” rules using rules.PART
classifier:-
PART decision list
outlook = overcast: yes (4.0)
windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0)
: yes (3.0)
Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
sunny -> no
overcast -> yes
rainy -> yes
(10/14 instances correct)
With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd place
and PART gets 3rd place.
Data Mining Lab
S.K.T.R.M College of Engineering 104
J48 PART oneR
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets
second place and oneR
gets lst place
J48 PART oneR
ACCURACY (%) 70.5 70.2% 66.8%
1.Open existing file as weather.nomial.arff
2.Select All.
3.Go to classify.
4.Start.
Data Mining Lab
S.K.T.R.M College of Engineering 105
Here the accuracy is 100%
Data Mining Lab
S.K.T.R.M College of Engineering 106
The tree is something like “if-then-else” rule
If outlook=overcast then
play=yes
If outlook=sunny and humidity=high then
play = no
else
play = yes
If outlook=rainy and windy=true then
play = no
else
play = yes
To click out the rules
1. Go to choose then click on Rule then select PART.
2. Click on Save and start.
3. Similarly for oneR algorithm.
Data Mining Lab
S.K.T.R.M College of Engineering 107
If outlook = overcast then
play=yes
If outlook = sunny and humidity= high then
play=no
If outlook = sunny and humidity= low then
play=yes
top related