weka 1 weka valeria guevara thompson rivers … valeria guevara thompson rivers university ......

52
WEKA 1 Weka Valeria Guevara Thompson Rivers University Author Note This is a final project COMP 4910 for the bachelors of computing science from the Thompson Rivers University supervised by Mila Kwiatkowska.

Upload: vanthien

Post on 15-Jun-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 1

Weka

Valeria Guevara

Thompson Rivers University

Author Note

This is a final project COMP 4910 for the bachelors of computing science from the

Thompson Rivers University supervised by Mila Kwiatkowska.

Page 2: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 2

Abstract

This project focuses on documents classification using text mining through a

classification model generated by the open source software “WEKA”. This software is a

repository of machine learning algorithms to discover knowledge. Weka easily preprocesses the

training documents to compare different algorithms configurations. The exactitude in the

generated predictive model will be measured based on a confusion matrix. This project will help

to illustrate text mining preprocessing and classification using WEKA. The result will be the

development of a tool to generate the input data files arff and of a video tutorial on documents

classification in Weka in English and Spanish.

Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning,

decision tree C4.5, words vector, text mining, F-measurement, machine learning, text

classification, stemming, knowledge society.

Page 3: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 3

Weka

Weka is a native New Zealand bird that does not fly but has a penchant for shiny objects.

[30] Newzealand.com. (2015). Old legends from New Zealand narrate that these birds steal shiny

items. The University of Waikato in New Zealand started the development of a tool with that

name because this would contain algorithms for data analysis. Currently WEKA package is a

collection of algorithms for machine learning tasks of data mining. The package of Waikato

Environment for Knowledge Analysis contains tools for data pre-processing, classification,

regression, clustering, association rules, and visualization. [31] Hall, M., Frank, E., Geoffrey H.,

Pfahringer, B., Reutemann, P., & Witten, IH (2009). This software analyzes large amounts of

data and decide which is the most important. It aims to make automatic predictions that help

decision making.

Weka VS Other Machine Learning Tools

There are other tools for data mining as RapidMiner, IBM Cognos Business Intelligence,

Microsoft SharePoint and Pentaho. IBM Cognos Business Intelligence provides a not very user-

friendly display. Microsoft SharePoint creates predictive models of mining business but their

information is not their main objective. Where RapidMiner offers a great display of results, but

the datasets are loaded slower than in Weka. Pentaho its graphical interface is not difficult to

understand to describe your options as Weka does.

The tool implements Weka machine learning techniques implemented in easy to learn

java under a GNU General Public License. WEKA provides three ways to be used, through its

graphical interface, command line interfaces and application code in Java API interface

Page 4: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 4

language. Although WEKA has not been used primarily for troubleshooting predictions in

business, this helps the construction of new algorithms. Therefore it turns out to be the most

optimal software for initial data analysis, classification, clustering algorithms, research.

In this project the Weka tool is used to create a predictive model using text classification

algorithms of machine learning algorithms.

Installation

Weka can be downloaded at: http://www.cs.waikato.ac.nz/~ml/weka/. In this case we

speak of the latest version 3.6.12 Weka. In the same URL you can find instructions for

installation on different platforms.

In Windows Weka must be located in launcher program in a folder version of Weka

downloaded, in this case the latest version is weka-3-6. Weka default directory is same directory

where the file is loaded.

Linux will have to open a terminal and type: java -jar /installation/directory/weka.jar.

It is common to find an error of insufficient memory, which in turn is achieved by

specifying for example GB 2GB will "-Xmx2048m" in the setup files. Further information

weka.wikispaces.com/OutOfMemoryException be found. You can be ordered with the -Xms and

-Xmx parameter indicating the minimum and maximum RAM respectively.

In windows you can edit the file RunWeka.bat RunWeka.ini or the installation directory

should be changed Weka maxheap = 128m = 1024m maxheap line. You can not assign more

than 1.4G to JVM. You can also assign to the virtual machine with the command:

java -Xms <minimum-memory-mapped> M

Page 5: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 5

-Xmx <Maximum-memory-mapped> M -jar weka.jar

[32] Garcia, D., (2006).

In linux the -XmMemorySizem option is used, replacing MemorySize the required size in

megabytes. for instance:

java -jar -Xm512m /instalación/directorio/weka.jar.

Execution

Weka The first screen will show a coach you are interfaces called "Applications" where

in this version of Explorer, Experimenter, KnowledgeFlow sub-CLI and Simple tools are

deployed. Explorer is responsible for conducting exploration operations on a data set.

Experimenter experiments performed statistical tests to create an automated manner different

algorithms different data. KnowledgeFlow shows graphically the operation panel work Weka.

Simple CLI or single client that provides the command line interface to enter commands.

The main user interface is "Explorer" consists of six panels. Preprocess is the first

window to open this interface. In this window, the data are loaded. Weka accepts load the data

set from a URL, database, CSV or ARFF files. The ARFF file is the primary format to use any

classification task in WEKA.

Input data.

As previously it was described, three data inputs are considered in data mining. These are

the concepts, instances and attributes. An Attribute-Relation File Format file is a file that

describes a concept list of instances with their respective attributes. These files are used by Weka

for text classification and clustering applications.

Page 6: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 6

ARFF files.

These files have two parts, the header information and data information. The first section

contains the name of the relationship with the attributes (name and type). The relationship name

is defined in the first line of arff where name-relation is a string with the following format:

@relation <relation-name>

The next section is the attribute declarations. This is an ordered sequence of statements of

each attribute instances. These statements uniquely define an attribute name and data type. The

order in which the attributes are declared indicates the position where you are in the instances.

For example, the attribute that is declared at the first position is expected in all instances at the

first position state the value of this attribute. The format for its declaration is:

@attribute <attribute-name> <data type>

Weka has several data-type supported:

i) NUMERIC: are all real numbers where the separation between the decimal and integer

part is represented by a point and not a comma.

ii) INTEGER: treated as numeric.

iii) NOMINAL provide a list of possible values for example {good, bad}. These express

the possible values that the attribute can take the following format:

@attribute attribute _name {<nominal1>, <nominal2>, <nominal3>, ...}

iv) STRING: is a sequence of text values. These attributes are declared, as follows:

@attribute attribute _name string.

v) DATE: dates and times are declared as:

@attribute <name> Date [<date format>].

Page 7: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 7

Where <name> is the name of the attribute and <date format> is an optional string

consists of characters hyphens spaces and time units. The date format Specify the

values to date should be analyzed. The format set accepts the combination of

format ISO-8601: aaaa-MM-dd'T'HH: mm: ss. Example:

@attribute timestamp DATE "yyyy-MM-dd HH: mm: ss"

vi) Relational attributes are data attributes for multiple instances in the following way:

@attribute <name> relational

<Attribute definitions Next>

@end <name>

There exist rules on the attribute statements:

a) The names of relations as string or string must be enclosed in double quotes "if

it includes spaces.

b) Both the attributes and relationships names cannot start with a character before

the \ u0021 ASCII '{', '}', ',', or '%'.

c) Values that contain spaces must be quoted.

d) Keywords numeric, real, integer, string and date can be case insensitive.

e) Relational data must be enclosed in double quotes ".

The second section is the statement of information. It is declared as @data on one line.

Each line below represents an instance defining attributes with commas. The attribute value must

be in the same order in which they were found in one section attribute. Missing values are

represented with a trailing question mark "?". The string values and nominal attributes are

Page 8: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 8

different between upper and lower case. It should be cited any value that contains a space.

Comments are cited delimiter character "%" to the end of the line.

In text classification, arff files represent the entire document as a single text attribute that

is of type string. The second attribute to consider is the class attribute. This will define the class

instance belongs. This type of attribute can be of type string or nominal. An example of the

resulting text file is the document type and the type string nominal class of two values:

@relation language

@attribute DocumentText string

@attribute class {English, Spanish}

@data

'texto a clasificar aquí... ', español

'Classify text here ...', English

Data preprocessing.

In this window, data are loaded and may be edited. Data can be manually modified with

edition or filtering. Filters are learning techniques methods that modify the data set. Weka has a

variety of filters structured hierarchically in supervised and non-supervised where the root is

weka. These filters are divided into two categories as a result of the way they operate with data

attribute and instance.

As point out earlier, these techniques are classified in a way that depends on the input

data relationships. Unsupervised learning techniques as descriptive inductive models do not

know their correct classification. This means that the instances do not require an attribute that

declares the class. Inductive techniques of predictive supervised learning depend on the class

values to which it corresponds. This refers to instances will contain a class attribute that

corresponds which they belong.

Page 9: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 9

In Current relation module the dataset that has been loaded is described as the name,

and instances number. Attributes allows to select attributes using options from All, None, Invert

and it further provides the option to enter a regular expression. In the Selected attribute part

display information about the selected attribute. At the bottom is illustrated a histogram of the

attributes selected in Attributes.

Preprocessing for classifying documents

In Weka is possible to create documents classification models into categories previously

analyzed. The documents in Weka usually need to be converted into "vectors text" before

applying machine learning techniques. For this the easiest way to render text is as bag of words

or word vector. [34] Namee, B. (2012). StringToWordVector filter performs the process of

converting the string attribute to a set of attributes that represent the occurrence of words of the

full text. The document is represented as a text string in a single attribute type string.

StringToWordVector Filter

This is the fundamental text analysis WEKA filter. This class offers abundant choices of

natural language processing, including the use of lexematización for convenient corpus, custom

tokens and using various lists of empty words. At the same time, it calculates weights Frequency

and Duration TF.IDF etc.

StringToWordVector places the class attribute to the top of the list of attributes. To

change the order it can use the filter Reorder to reorder. This filter can be configured all the

techniques of linguistic natural language processing to attributes. To apply the filter

StringtoWordVector in batch mode from the command line can be done as follows:

Page 10: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 10

Java -cp/Aplicaciones/weka-3-6-2/weka.jar

weka.filters.unsupervised.attribute.StringToWordVector -b -i

datos_entrenamiento.arff -o vector_ datos_entrenamiento.arff -r

datos_prueva.arff vector_ data_ prueva .arff

The sets datos_entrenamiento are the training set, vector_ datos_entrenamiento are the

training set vector, datos_prueva are the test set and vector_ data_ prueva are the test set

vector. The -cp option puts Weka jar in the class path, use -b indicates the batch mode, -i file

specifies the training data, -o output file after processing the first file, -r is my file Test and -S is

the output file of the previous test file.

Options can be modified in the user interface, when you click on the filter name beside

the choose button. Having previously selected the filter from Booton choose.

Having the window open weka.filters.unsupervised.attribute.StringToWordVector show

the following to be modified according to the needs of the documents to be classified. The

options are:

IDFTransform

TFTransform

attributeIndices

attributeNamePrefix

doNotOperateOnPerClassBasis

invertSelection

lowerCaseTokens

minTermFreq

normalizeDocLength

outputWordCounts

periodicPruning

stemmer

stopwor

tokenizer

useStoplist

wordsToKeep

Page 11: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 11

In Weka.sourcearchive.com [39] refers to a mental map of Weka options which are as

follows shown in the following illustration is:

Page 12: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 12

wordsToKeep

Defines the number N of words per class limit, if there is a class attribute which is trying

to maintain. In this case only the more common N terms among all attribute values in the chain

will remain. Higher values represent lower efficiency because it will take more time learning

model.

doNotOperateOnPerClassBasis

Flag that set to keep all relevant words for all classes. It is set to true when the maximum

number of words and the minimum term often does not apply to an attribute of a class, instead it

is based on all classes.

TFTransform

Term frequency score (TF) Transformation: when position the flag as true, this filter

executes the transformation term-frequency score representing textual data in a vector space the

term-frequency (TF) is used. The TF represents numerical measure the words of the text

relevance among the entire collection. This not only considers the relevance of a single term

itself, it also contemplates the relevance in the entire collection of documents.

Mathematically its represented as the function TF (t, d) which expresses the term t in the

document d is as: log (1 + t word frequency on the instance or document d). The inverse

document frequency IDF is the number of documents containing the term t appear where t is

defined in the TF. It find words often related in terms of log (1 + IJF) where fij is the frequency

of the word t in the document (example) j.

DFTransform

Inverse Document Frequency (IDF) Transformation: positioning the flag with "true" will

define the use of the following equation:

Page 13: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 13

t word frequency in instance d as ftd and as a result:

F td * log (nº documents and instances d / nº of documents with word t)

This is explained taking into account set D which includes all documents in the collection

represented as D = {d1, d2, ..., dn}. It finds out most relevant documents to the other fij * log (nº

Docs / nº nº of Documents with the i word) where fij is the frequency of word i in document j.

By multiplying IDF by the TF the result assign more weigh to the terms with greater

frequency in the documents but at the same time relatively rare in the collection of documents.

Weight [33] Salton, G., Wong, A., & Yang, C. (1975).

outputWordCounts

Counts the words occurrences in the string, the default settings only reports the presence

or absence as 0/1. The result is a vector where each dimension is a different word. The value in

this dimension is a binary 0 or 1 is say yes or no is the word in that document.

The frequency of the word in that document is represented as an integer number with:

IDFTransform and TFTransform as "False" and outputWordCounts to "True" opccions.

This is enable to do an explicit words account. It is established as "false" when only cares

about the presence of a term, not its frequency.

To calculate tf * (IDF) must be set IDFTransform as True, TFTransform as false and

outputWordCounts set as True.

To achieve log (1 + tf) * log (IDF) TFTransform must be set to True.

normalizeDocLength

It is set true to determine whether the words frequency in an instance must be normalized.

Normalization is calculated as Actual Value * Average Document Length / Document Length .

Page 14: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 14

This option is set with three sub-options, the first option "No normalization". The second is

"Normalize all data" that takes a measure as a common scale of all measures taken in the various

documents. The third option is "Normalize test data only." It has a word with a real value of the

tf-idf result of the word in that document with the settings as follows IDFTransform and

"TFTransform" to "True" and "normalizeDocLength" to "Normalize all data."

Stemmer

Selects the stemming algorithm to use in the words. Weka by default supports four

default stemmer algorithms. Lovin Stemmer algorithm is its iterated version and supports

Snowball stemmers. IteratedLovinsStemmer algorithm is a version of the algorithm

LovinsStemmer which is a set of transformation rules for changing word endings as well as

words present participle, irregular plurals, and morphological English. NullStemmer algorithm

performs any derivative at all. The algorithm SnowballStemmer came standard vocabularies of

words and their equivalents roots.

Weka can easily add new algorithms stemmer because it contains a wrapper class for as

snowball stemmers in Spanish. Weka contains all algorithms snowball but can be easily included

in the location of the class weka.core.stemmers.SnowballStemmer Weka.

Snowball is a string processing language designed for stemming creation. There are three

ways to get these algorithms, the first is to install the unofficial package. The second is to add

snowball-20051019.jar pre-compiled class location. The third is to compile the latest stemmer by

itself from snowball-20051019.zip. The algorithms are in snowball.tartarus.org that have a

stemmer in Spanish. In the following link you can see examples and download this stemmer:

http://snowball.tartarus.org/algorithms/spanish/stemmer.html

Page 15: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 15

Snowball Spanish Stemming Algorithm comes from Snowball.tartarus.org. It defines an

usual R1 and R2 regions. Furthermore RV is defined as the following vowel after the region if

the second letter is a consonant, or RV and after the following consonant the region, if the first

two letters are vowels, or RV as the region also after the third letter if these options do not exist

RV is the end of the word.

Step 0: Search the longest pronoun between the following suffixes: "I selo selos selas is

SELA's you what the will of us" and remove it, if it comes after one of iendo ar Ando ír

ER'm iendo ar er get going.

Step 1: Look in the longest common suffix and deletes it.

Step 2: If no suffix is not removed in step 1 seeks to eliminate other suffixes.

Step 3: Find the longest among the residual suffixes “os a o á í ó e é” in RV and

eliminates them.

Step 4: remove sharp accents. [36]

. For more information about suffixes in step 1 and 2 go to snowball

http://snowball.tartarus.org/algorithms/spanish/stemmer.html page.

The previous algorithm will be added into weka when the following command for

Windows is applied:

java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser

For Linux:

java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser

[37] Weka.wikispaces.com ,. (2015).

Page 16: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 16

The jar snowball-20051019.jar previously compiled and stored in the location where the

application of Weka on the computer.

It may confirm with the command:

java weka.core.SystemInfo

As shown in the figure below.

Stopwords

This are terms that are widespread and appears more frequently and do not provide

information about a text. This option determines whether a sub string in the text is an empty

word. Stopwords terms come from predefined list. This option converts all words to lowercase

before term removal. Stopwords it is pertinent to eliminate meaningless words within the text

and eliminate frequent and useful words of decision trees. Weca´s stopwords by default are

based on the Rainbow lists that are found in the next link:

http://www.cs.cmu.edu/~mccallum/bow/rainbow/.

Page 17: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 17

Rainbow is a program that performs statistical text classification. It is based on the Bow

library. [38] Cs.cmu.edu, (2015). The format of these lists is one word per line, where each

comments must start with '#' to be omitted. WEKA is configured with a list of empty words

English but you can set different lists of stopwords. You can change this list from the user

interface by clicking on the option you have Weka by default uses Weka-3-6 list but it can

choose any location that points to a desired list. Rainbow has separate lists for English and

Spanish, in order to make both languages the "ES-stopwords" add both lists from Rainbow.

useStoplist:

Flag to use empty words. If is set to "True" ignores the words that are in the predefined

stopwords list from the previous option.

Tokenizer:

Choose measurement unit to separate each text attribute from the arff. This has three sub

options. The first is AlphabeticTokenizer where only alphabetical symbols are continuous

sequences that cannot be edited. When tokenize only considers the alphabet in English. At the

same time there is WordTokenizer option that establishing a list of delimiters. As was referenced

previously, punctuation in Spanish is ";:.?!?! - - () [] '" << >> ". In Spanish, unlike English

contemplates a sign of the beginning and another end in an exclamation.

The second is NGramTokenizer that divides the original text string in a subset of

consecutive words that form a pattern with unique meaning. Its parameters are derived

"delimiters" to use that default is '\ r \ n \ t,;:.' "()! 'GramMaxSize which is the maximum size of

the Ngram with a default value of 3 and GramMinSize be the minimum size of the Ngram with a

Page 18: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 18

default value of 1. N-grams can help uncover patterns of words between them which represent a

meaningful context.

minTermFreq:

Sets the minimum frequency that each word or term must possess to be considered as an

attribute, the default is 1. It is often applied when class has an attribute that has not been set to

true flag "doNotOperateOnPerClassBasis" the text of the entire chain for a particular class that is

in that same attribute is selected tokenisa.

The frequency of each token is calculated based on its frequency in the class. In contrast,

if there is no class, the filter will calculate a unique dictionary and the frequency is calculated

based on the entire attribute value chain of the chosen attribute, not only those related to a

particular class value.

periodicPruning

Eliminates low-frequency words. It uses a numerical value as a percentage of the size of

the document that sets the frequency to prune the dictionary. The default value is -1, meaning no

periodic pruning. Periodic pruning rate is specified as a percentage of the data set. For example,

this specified that 15% of each set of input data, regularly pruned in the dictionary, after creating

a comprehensive dictionary. May not have enough memory for this approach.

attributeNamePrefix

Sets the prefix for the names of attributes created, by default is "". This only provides a

prefix to be added to the names of the attributes that the filter StringToWordVector created when

the document is fragmented.

Page 19: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 19

lowerCaseTokens

Flag when its set to "True", converts all words in the document into lowercase before

being added to the record. Flag true eliminate the option to distinguish themselves by eliminating

the rule names that begin with uppercase names. Acronyms may be considered when this option

to is set to "False".

attributeIndices

Sets the range of attributes to act on the sets of attributes. The default is first-last which

ensures that all attributes san treated as if they were a single chain from first to last. This range

will create a chain of ranges containing a comma-separated list.

invertSelection

Flag to work with the attributes selected in the range. It stands as true to work with the

unique attributes unselected "true" or. The default value is "False" is work with the selected

attributes.

After cleaning the data on the tab "Preprocess" vector attributes are analyzed to obtain the

desirable knowledge in the "Classify" tab.

Classification

The second panel of Explorer is "Classify" or classification generated by machine

learning model from the training data. These models serve as a clear explanation of the structure

found in the information analyzed. Weka especially considering the model J48 decision tree for

the most popular text classification. J48 is the Java implementation of the algorithm C4.5.

Previously described as the algorithm that each branch represents one of the possible choices in

the if-then format that the tree offers to represent the results in each leaf. It can summarized the

Page 20: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 20

C4.5 algorithm as the amount of measurement of the information contained in a data set and

grouped by importance. The idea of the importance of a given attribute in a dataset. J48 Print

recursively the tree structure variable of type string by accessing information stored in each

attribute nodes.

To create a classification, you must first choose the algorithm classifier in the “Choose”

button located in the upper left side of the window. This button will display a tree where the root

is Weka and sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees

tree models such as J48 and RepTree are found. RepTree combines the standard decision tree

with random forest algorithm. To access the classifier's options are given double-click the name

of the selected classifier.

"Test Options".

The classification has four main modes and others to manage the training data. These are

found in the section "Test Options” with the following options

a) Use training set: training method with all available data and apply the results on the

same dataset collection.

b) Supplied test set: select training data set froma file or URL. This set must be

compatible with the initial data and is selected by pressing "Set" button.

c) Cross-validation: performs a cross-validation depending on the number of "Folds"

selected. Cross-validation specify a number of partitions to determine how many

temporary models will be created (Folds). First a part is selected, then a classifier is built

from all parts are except the selected one that remains for testing. [32] Garcia, D., (2006).

d) Percentage Split: define the percentage of the total input from the classifier model was

built and the remaining part will be tested.

Page 21: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 21

Weka allows us to select more than a few options for defining the test method with the

"More Options" button, these are:

Output Model: open in the output window pattern classifier.

Output per-class stats: display statistics for each class.

Output entropy evaluation measures: displays measurement information entropy in the

standings.

Output confusion matrix: displays the resulting confusion matrix classifier.

Store predictions for visualization: Weka will keep classifier model predictions as in the

test data. In the case of using this option classifier J48 will show the tree errors.

Output predictions: show a table of the real and predicted values for each instance from

test data. It states the relation between the classifier and each instance in the test data.

Output additional attributes: is set to display the values of attributes, not those of the

class. A range will be specified to be included along the actual and predicted values of

the class.

Cost-sensitive evaluation: produce additional information on the output of the

assessment, the total cost and average cost of misclassification.

Random seed for xcal /% Split: specifies the random seed used when before data have

been divide for evaluation purposes.

Preserve order for% Split: Retains the order in the percentage of data instead of creating

a random for the first time with the value of the default seeds is 1.

Output source code: generate the Java code model produced by the classifier.

Page 22: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 22

In the event that does not have a set of data independent evaluation it is necessary to

obtain a reasonably accurate idea of the generated model and select the correct option. In the

case of classifying documents is recommended select at least 10 "Folds" for cross-validation and

assessment approach. It also recommends allocating a large percentage of "Percentage Split".

Below these options "Test Options", it is a menu where a list with all attributes will be

find. This allows you to select the attribute that act as the result for classification. In the case of

the classification of documents will be the class to which the instance belongs.

The classification method start by pressing the "Start" button. The image of the weka bird

found in the bottom right will start to dance till the classifier achieves complete.

WEKA creates a graphical representation of the classification tree J48. This tree can be

viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree"

option. The window size can be adjusted by right-clicking and select “Fit to Screen”.

Classifier for classifying documents J48

The model J48 uses the decision tree algorithm C4.5 to build a model from selected

training data. This algorithm is found in weka.classifiers.trees. J48 classifier has different

parameters that can be edited by double clicking on the name of the selected classifier.

J48 employs two pruning methods, but this does not make the pruning of error. The main

objectives of pruning are to make the tree easier to understand and reduce the risk of overuse of

the training data in the direction of be able to classify just about perfectly. The tree learn the

specific properties of the training data and not the lower concept.

Page 23: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 23

The first J48 pruning method is known as replacement subtree. The nodes in a decision

tree can be replaced with a leaf by reducing the number of nodes in a branch. This process starts

from the fully formed leaves and work up towards the root.

The second is to raise the hive. A node is move to the tree root and replaces other nodes

in the branch. Normally, this process is not negligible and is wise turn it off when the induction

process takes time.

By clicking on the name of the J48 classifier which is located right next to the “Choose"

will display a window with the following editable options:

confidenceFactor sets the number of pruning. Lower values experience more pruning.

Reducing this value may reduce the size of the trees and also helps in removing irrelevant

nodes that generate misclassification. [40] Drazin, S., & Montag, M. (2015).

minNumObj: Sets the minimum number of instances separation per leaf in the case of

trees with many branches.

unpruned: flag to preform pruning. In true the tree is pruned. Default is "False" which

means that pruning is not carried out.

reducedErrorPruning: flag to use pruning error reduction in C.4.5 tree. Method after

pruning using a resistance to the errors estimations. Similarly, it is for breeding hives and

throw an exception not the confidence level used for pruning.

Seed: Seed number shuffle data randomly and reduce error pruning. This is considered

when reducedErrorPruning flag is set to "True". The default seed is 1.

numFolds: number of pruning to reduce error. Sets the number of folds that are retained

for pruning, with a set used for pruning and the rest for training. To use these Folds

reducedErrorPruning flag must be set to "True".

Page 24: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 24

binarySplits: when this flag is set "True", it creates only two branches for nominal

attributes with multiple values instead of a branch for each value. When the nominal

attribute is binary there is no difference, except in how this attribute is shown in the

output result. The default is "False".

saveInstanceData: flag set to "True" to store training data for its visualization. The

default is "False".

subtreeRaising: flag to preform pruning with the subtree raising method. This moves a

node to the tree root replacing other nodes. In "True" weka considered subtreeRaising in

the process of pruning.

useLaplace: flag that preform a leaves count in Laplace. Set to "True", weka will count

the leaves that become smaller based on a popular complement to estimates probability

called Laplace.

debug: banner to add information to the console. In "True", it adds additional information

to the console of the classifier.

It can reach 100% correct in the training data clearing pruning and establish the

minimum number of instances on a sheet 1.

Page 25: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 25

Weka document classification

Weka tool was selected in order to generate a model that classifies specialized documents

from two different courpus (English and Spanish). WEKA package is a collection of machine

learning algorithms for data mining tasks. Text mining uses these algorithms to learn from

examples or "training set", new texts are classified into categories analyzed. It is defined as

Waikato Environment for Knowledge Analysis. For more information contact

http://www.cs.waikato.ac.nz/~ml/weka/.

Installing WEKA

Weka can be downloaded from:

http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

In this tutorial version is Weka 3.6.12.

For Windows

WEKA must be situated in the program launcher located in a weka folder. The Weka

default directory is the same directory where the file is loaded.

For Linux:

WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.

Page 26: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 26

Based on the text mining methodology Weka is represented in a framework with four

stages, data acquisition, document preprocessing, information extraction and evaluation.

Data Acquisition

ARFF files are the primary format to use any classification task in WEKA. These files

considered basic input data (concepts, instances and attributes) for data mining. An Attribute-

Relation File Format file describes a list of instances of a concept with their respective attributes.

The documents selected for the training data set has been found on the Thompson Rivers

University library that has the following link: http://www.tru.ca/library.html. It was randomly

selected 71 medical academic articles in English and Spanish. These documents are stored in

Portable Document Format (PDF). Based on the TRU library was detected the classification of

this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes

recognized. These documents are stored in directories named by its categories within the main

folder called Medicine. As shown in the figure below.

In order to form an arff file it was created in Microsoft Visual Studio Professional C #

2012 an application that generated the arff from a directory that contains a collection of

Page 27: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 27

documents in a based on their category name. This application could be carried out with the

collaboration of a library called iTextSharp PDF for a portable document format text extraction.

Documents Directory to ARFF can specify the name of the relationship to define, the

location of the home directory that contains all documents subdivided into categorical directories

and comments required. Also, it specify the file name generated with arff extension and its

location. At the end of the application are two buttons, one for exit and another to generate the

arff file with the information described.

This can be download http://www.scientificdatabases.ca under current projects for Text

Mining.

The resulting arff generate a string type attribute called " textoDocumento" that describe

all text found in the document and the nominal attribute "docClass" that define the class to

which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class

attribute can never be named "class".

Page 28: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 28

The file will be generated as follows:

% tutorial de Weka para la Clasificación de Documentos.

@RELATION Medicina

@attribute textoDocumento string

@attribute docClass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, Diabetes}

@data

"texto…", Hemodialysis

“texto…”, Nutrition

"texto….", Cancer

"texto…", Obesity

"texto…", Diet

"texto…", Diabetes

Document Preprocess

Weka contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization.

"Applications" is the first screen on Weka to select the desired sub-tool. In this

"Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select

attributes and Visualize.

Preprocess

Preprocessing for the classification of documents.

To load the generated arff, click on the button "Open file ..." at the top right.

Select the created file "medicinaWeka.arff".

On "Current Relation" the dataset that has been loaded is described. It describes the

relationship with the medicina name, the number of instances as 71 and a total of attributes as 2.

At the bottom of the under "Attributes" section, attributes are described. This framework allows

to select the attributes, in this case are show " textoDocumento " and "docClass".

When selecting "docClass" the "Selected attribute" part describes the nominal attribute

with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12

Page 29: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 29

instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this

section is ilustrated a histogram of the attribute "docClass" labels that by hovering the graph it

will describe the attribute name as shown in the following figure illustrates.

Weka uses StringToWordVector filter to convert the "textoDocumento" and

"docClass"." attribute into a set of attributes that represent the occurrence of words of the full

text,. This filter is a technique of unsupervised learning. These inductive technique is designed to

detect clusters and label entries from a set of observations without knowing the correct

classification.

The filters are found when click the “Choose " button under "Filter" section. This button

opens a window with root weka. From there selecte filters and the unsupervised folder to after

select attribute and finally select StringToWordVector.

Page 30: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 30

StringToWordVector filter can configured its attributes with language processing

techniques. To edit this filter is only necessary to click on the filter name. it will open a that

show the following options.

They were generated a set of optimal options from different combinations of options

applied to the same training data . Each resulting model was calculated its F-measurement which

describes the proportion of its predicted instances erroneously. The options that generated the

greatest number of instances predicted correctly are as follows:

a) wordsToKeep: Standing with 1000 since it defines the word limit per class to maintain.

Where doNotOperateOnPerClassBasis flag: as "False" to base wordsToKeep in all

classes.

b) TFTransform as "True", DFTransform as "True" outputWordCounts as "True" and

normalizeDocLength: is set to "No normalization".

The values are not normalized to the filter papers find more interrelated and count how

often a word is in the document and not only consider whether the term is in the

document. OutputWordCounts is the flag that describes whether a word exist or not in the

document and normalizeDocLength couts a word with its actual value from tf-idf result

of that word in the document, no matter how small or longer the document is.

c) lowerCaseTokens: as "True" to convert all to lowercase words before being added to

the record and analyze the same word in lowercase and uppercase separately.

Page 31: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 31

d) Stemmer: selects the algorithm to elimination the morpheme in a given language in

order to reduce the word to its root. Select no stemmer as the classification of texts is

multilingual and it will only aply stemming for one lenguage. No stemmer is configured

when click on the "Select" button menu is deployed and "NullStemmer" is selected.

Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a

string processing language designed for creating stemmer and feature a stemming

algorithm in Spanish. To use the algorithm in Spanish will have to download the jar

snowball-20051019.jar from https://weka.wikispaces.com/Stemmers. This will be stored

in the location where Weka application is. Finally the algorithm will be added when the

following command is applied from the command line in Weka.

For Windows: java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser

For Linux: java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser

It will be confirmed with the command to verify the parameter java.class.path

java weka.core.SystemInfo

As shown in the following figure:

Page 32: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 32

Having set the SnowballStemmer, Selecte it by clicking the "Choose" button.

This button will display a menu which selecte from weka> core> stemmers and choose

SnowballStemmer.

Click on the stemmer name and a window that can delimit the language will apear. For

Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and

click "OK".

e) Stopwords determines whether a sub string in a text is a word that does not provide

information about a text. This words come from a predefined Rainbow list, where the

default is Weka-3-6. Rainbow is a program that performs the statistical text

classification base on Bow library. Rainbow has separate lists in English and Spanish,

in order to make both languages is use the "ES-stopwords" file that contains both lists

from Rainbow. "ES-stopwords" list can be download from

http://www.scientificdatabases.ca/current-projects/english-spanish-text-data-mining/.

To change the list click on Weka-3-6 which is next to the label stopwords and

choose “ES-stopwords" previously downloaded. Set the useStoplistse option to

Page 33: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 33

"True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option

list.

f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By

clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set

the "deimiters" in English and Spanish when cloc on the name and following window

will appear. Delimiters in Spanish are,;: .,;:'()?!“¿!-[]’<>“ ".. this includes an end

character in for exclamation and interrogation. .,;:'"()?!“¿!-[]’<>“

As shown in the figure below.:

Another option is to choose NGramTokenizer to divide the original text string in a

subset of consecutive words that form a pattern with unique meaning. This uses the

default "delimiters" is '\ r \ n \ t,;:.' ?! "()", This is useful to help uncover patterns of

words between them representing a meaningful context.

g) minTermFreq: default is 1 for each word must to possess to be considered as an

attribute to this the "doNotOperateOnPerClassBasis" flag should be "False".

h) periodicPruning be filed in no pruning with -1, it won’t remove low-frequency

words.

Page 34: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 34

i) attributeNamePrefix lefts with nothing to not add a prefix to the attributes

generated.

j) attributeIndices: will be saved as first-last to ensure that all attributes are treated as

if they were a single chain from first to last.

k) invertSelection be preserved in "False" to work with the selected attributes.

At the end, you can save, cancel and apply. The window must have been as follows:

Page 35: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 35

To save the algorithm with these options click on Save ..." button and the select the

location and name.

To apply the algorithm with these options in the click "OK" button. This will return to the

"Preprocess" window where "DocumentText" attribute must have been selected from the

"Attributes" framework.

Click the button "Apply". It is located in the upper right of the module "Filter". Weka

image located in the lower right corner will start to dance until the process is complete.

Information extraction

After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of

information. By click on the tab "Classify" on the second panel of Explorer.

This stage analyze the attributes vector for the creation of the classification model that

will define the structure found in the analyzed information.

Weka considered the decision tree model J48 the most popular on text classification. J48

is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of

the possible decisions to be taken and each leave represent the predicted class.

First, choose the sorting algorithm from the "Choose" button located in the upper left side

of the window.

Page 36: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 36

This button will display a tree where the root is weka and the sub folder is "classifiers".

Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as

shown in the following figure:

Double-click on the name of the J48 classifier located next to the "Select"

button to access to its options.

Page 37: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 37

It can reach 100% in correct classification disabling pruning and setting the

minimum number of instances in a leaf as 1. In this case these parameters changed

are:

a) minNumObj: is set to 1 and leave the other parameters in the default configuration.

In the "Test Options" module the training data is set.

Select “Use training set" to train the method with all available data and apply the results

on the same input data collection.

Page 38: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 38

Additionally you can apply a partitioning percentage to the input data by selecting the

"Percentage Split" option and defining the percentage from the total input data to build the

classifier model, leaving the remaining part to test.

Under options "Test Options" is a menu that displays a list with all attributes. In the case

select "docClass" because this is the attribute that act as the result for classification in this

example.

The classification method started by pressing the "Start" button.

The weka bird image found in the bottom right, will begin to dance until the end of the

sorting process.

Page 39: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 39

WEKA creates a graphical representation of the classification tree J48. This tree can be

viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or

"tree Display" option.

Page 40: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 40

The window size can be adjusted to make it more explicit by right clicking and selecting

"Fit to Screen", as show in the image below.

Results Evaluation

Weka describes the proportion of instances erroneously predicted with the measure - Fβ

score. The value is a percentage consist of precision and Recall. Precision measures the

percentage of correct positive predictions that are truly positive Recall is the ability to detect

positive cases out of the total of all positive cases.

Page 41: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 41

With these percentages it is expected that the best model is the F-measure value closer to

1. The following table shows some combinations that are significant in the data preprocess for

model generation. This comparison table describes its measures of precision and recall as well as

its measurement-f.

First the best filter options are analyzed with unadjusted values for the J48 classifier. In

this the best parameters are selected. After the best settings for J48 classifier algorithm are

selected with the best configuration on the StringToWordVector filter.

Comparison table: Documents classification models.

Features Precision Recall F-Measure

Word Tokenizer English Spanish (E&S ) 0.810 0.803 0.800

Word Tokenizer E&S + Lower Case Conversion 0.863 0.859 0.860

Trigrams E&S + Lower Case Conversion 0.823 0.775 0.754

Stemming + Word Tokenizer E&S + Lower Case

Conversion

0.864 0.817 0.823

Stopwords + Word Tokenizer E&S + Lower Case

Conversion

0.976 0.972 0.972

Stopwords + Stemming +

Word Tokenizer E&S + Lower Case Conversion

0.974 0.972 0.971

Stopwords + Word Tokenizer E&S + Lower Case

Conversion + J48 minNumObj = 1

1 1 1

In conclusion the best model is a combination of the options Word Tokenizer Stopwords

+ S + E & Lower Case Conversion applied to the filter on the data preprocessing and further

adjusting 1 minNumObj on the J48 classifier algorithm.

Page 42: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 42

The next confusion matrix is the result from the combination of Stopwords + Word

Tokenizer E&S + Lower Case Conversion adjusting minNumObj to 1 on the J48 algorithm.

This generates the following binary values in their confusion matrix.

a b c d E f Classified as

11 0 0 0 0 0 a = Hemodialysis

0 12 0 0 0 0 b = Nutrition

0 0 12 0 0 0 c = Cancer

0 0 0 12 0 0 d = Obesity

0 0 0 0 12 0 e = Diet

0 0 0 0 0 12 f = Diabetes

This table only shows classes with precision and recall at 100%. Accuracy values are as

follows for each class:

Class TP Rate FP Rate Precision Recall F-Measure

Hemodialysis 1 0 1 1 1

Nutrition 1 0 1 1 1

Cancer 1 0 1 1 1

Obesity 1 0 1 1 1

Diet 1 0 1 1 1

Diabetes 1 0 1 1 1

Weighted Avg. 1 0 1 1 1

Page 43: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 43

Conclusion

Document classification in Spanish is analyzed using text mining through Weka an open

source software. This software analyzes large amounts of data and decide which is the most

important. It aims to make automatic predictions that help decision making. When comparing

WEKA with other data mining tools as RapidMiner, IBM Cognos Business Intelligence,

Microsoft SharePoint and Pentaho, weka provides a friendly interface easy to understand, load

data efficiently and consider data mining as main objective.

Text mining seeks patterns extraction from the analysis of large collections of documents

in order to gain new knowledge. Its purpose is the discovery of interesting groups, trends,

associations and the visualization of new findings.

Text mining is considering as a subset of data mining. For this reason, adopts text mining

adopts the data mining techniques which uses machine learning algorithms. Computational

linguistics techniques also provides techniques to text mining. This science studies natural

language with computational methods to make them understandable by the operating system.

Automatic categorization determines the subject matter from a document collection. This

unlike clustering, choose the class to which a document belongs in a list of predefined classes.

Each category is trained through a previous manual process of categorization.

The classification starts with a set of training texts previously categorized then generate a

classification model based on the set of examples. This is be able to allocate the correct clas from

a new text. Decision tree is a classification technique that represent the knowledge through if-

else statements structure represented in the branches of a tree.

Page 44: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 44

Textual mining methodology provides a framework performed in four stages, data

acquisition, preprocessing documents, information extraction and evaluation of results. Witten,

Frank and Hall make mention of these steps in his work for the use of WEKA.

Data should be collected in a way that can create a training dataset. Witten, Frank and

Hall considers three input data for text mining. These are the concepts, instances and attributes.

The concepts specify what is want to learn. An instance represents the data from a class to be

classified. This containing a set of specific characteristics called attributes. An attribute

represents a measurement level of the attribute in that instance. In the case of document

classification, classes will be nominal attributes, because the categories need not represent an

order between them (ordinal attributes).

WEKA uses a standard format called File Attribute Relation (ARFF) to represent the

collection of documents into instances that share an ordered set of attributes divided into 3

sections, relationship, and attribute data.

Preprocessing data is based on the preparation of the text using a series of operations over

the text and generate some kind of structured or semi-structured information for analysis. The

most popular way to represent documents is with a vector. That vector contains all words found

in the text indicating its occurrence. Important tasks for preprocessing to categorize documents

are stemming, lexematización, removing empty words, tokenization and conversion to

lowercase.

Stemming algorithm eliminates morphemes and find the relationships between words and

lexeme not themed. Stopwords exclude the words that not help to generate knowledge of the

text. Tokenization is how to separate the text into words using punctuation. In Spanish

punctuation are "; . :? ! - -. () [] '"<< >>" Where the dot and dash are ambiguous in Spanish,

Page 45: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 45

unlike English contemplates a sign of end in an exclamation and interrogation. Conversion to

lowercase treat all letters regardless equal terms.

After data preprocess, the next step is knowledge extraction. Document classification in

weka look for learn a predictive classification model. These models are used to predict the class

to which an instance belongs. The model is created using the decision tree algorithm C4.5 as it is

the simplest and wide for the classification task.

Weka generates a confusion matrix for the generated model. This shows in an easy way

to detect how many times the model predictions were made correctly. The four possible

outcomes are: true positives, false positives, true negatives and false negatives. TP - true

positive: positive instance was predicted in the class as positive. TN - true negative: negative

instance correctly classified as negative. FP - false positives: positive instance was listed in the

wrong class. FN - false-negative negative instance incorrectly classified as positive.

The precision and recall are relevant metrics for document classification. The classified

model reports results in a binary form in a confusion matrix, to calculate the predictive efficiency

expressed. Precision is the percentage of positive cases correctly predicted: TP / (TP + FP).

Recall or sensitivity is the ability to predict positive instances on the total of all positive

instances: TP / (TP + FN). These measures are balanced as the F- measurement. It describes the

proportion of instances wrongly predicted. As far as resulting F1- measurement is calculated by

the following equation (2 * Accuracy * completeness) / (Accuracy + completeness).

The training data set selected has been found on the Thompson Rivers University library.

It was randomly selected 71 medical academic articles in English and Spanish stored in PDF

format. Based on the TRU library was classified this documents into six categories

Page 46: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 46

Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are

stored in directories named by its categories within the main folder called Medicine.

In order to form an arff file it an application that generated the arff from a documents

collection a directory based. This application could be carried out with the collaboration of a

library called iTextSharp PDF for a portable document format text extraction. This application is

named as Documents Directory to ARFF.

The resulting arff generate a string type attribute called "DocumentText" that describe

all text found in the document and the nominal attribute "docClass" that define the class to

which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class

attribute can never be named "class".

Various tests applied to the same set of texts to assess the predictive exactitude of the

model. They were generated a set of optimal options from different combinations of options

applied to the same training data . Each resulting model was calculated its F-measurement which

describes the proportion of its predicted instances erroneously.

First the best structure for the filter is analyzed, with unadjusted the J48 classifier options.

In this the best parameters for the filter were selected. It select the best configuration to assess

the best settings for J48 classifier algorithm. Based on a comparison chart it was discovered that

the parameters of the combination of Stopwords + Word Tokenizer E&S + Lower Case

Conversion adjusting the minNumObj to 1 on the J48 algorithm, provide values of 1 for recall

and precision.

Concluding that the best model is the combination of the options Word Tokenizer

Stopwords + S&E + Lower Case Conversion applied to the data preprocessing filter and further

adjusting minNumObj to 1 on the J48 classifier algorithm.

Page 47: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 47

References

[1] Witten, I. H., Frank, E. ;., & Hall, M. A. (2011). Data Mining: Practical Machine Learning

Tools and techniques / Ian H. Witten (3a. ed. --.). s.l.: Elsevier.

[2] Berry, M. W., & Kogan, J. (2010). Text mining. [electronic resource] : applications and

theory. Hoboken, NJ : John Wiley & Sons, 2010.

[3] Hearst (1999). Untangling Text Data Mining, Proc. of ACL’99: The 37th Annual Meeting of

the Association for Computational Linguistics, University of Maryland, June 20-26,

1999.

[4] Kodratoff (1999). Knowledge Discovery in Texts: A Definition and Applications, Proc. of

the 11th International Symposium on Foundations of Intelligent Systems (ISMIS-99),

1999

[5] Montes-y-Gómez, M. Minería de texto: Un nuevo reto computacional. Laboratorio de

Lenguaje Natural, Centro de Investigación en Computación, Instituto Politécnico

Nacional.

[6] Ethnologue,. (2015). Summary by language size. Retrieved 23 June 2015, from

https://www.ethnologue.com/statistics/size

[7] Brun, R.E., & Senso, J.A. (2004). Mineria Textual. El profecional de la informacion, 3.

[8] Hotho, A., Nürnberger, A. & Paaß, G. (2005). A Brief Survey of Text Mining. LDV Forum -

GLDV Journal for Computational Linguistics and Language Technology, 20, 19-62.

[9] Streibel, O. (2010). Mining Trends in Texts on the Web. DC-FIS 2010 Doctoral Consortium

of the Future Internet Symposium 2010.

Page 48: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 48

[10] Gémar, G., & Jiménez-Quintero, J. A. (2015). Text mining social media for competitive

analysis. Tourism & Management Studies, 11(1), 84-90.

[11] Quinlan J. R. (1986) Induccion of decision trees. Machine Learning, 1(1), 81–106.

[12] Quinlan (1993) C4.5: Programs for Machine Learning Morgan Kaufmann.

[13] Hernández, J., Ramírez, M.J., & Ferri, C. (2004). INTRODUCCIÓN A LA MINERÍA DE

DATOS. Pearson.

[14] Ye, N. (Ed.). (2003). The handbook of data mining (Vol. 24). Mahwah, NJ/London:

Lawrence Erlbaum Associates, Publishers.

[15] Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data

applications. Waltham, MA: Academic Press.

[16] Stevens, S. (1946). On The Theory Of Scales Of Measurement. Science, 677-680.

[17] Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification.

Information Processing And Management, 50104-112. doi:10.1016/j.ipm.2013.08.006

[18] Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic

Indexing. Communications Of The ACM, 18(11), 613-620. doi:10.1145/361219.361220

[19] Ning Liu, Benyu Zhang, Jun Yan, Zheng Chen, Wenyin Liu, Fengshan Bai, Leefeng Chien.

2005. Text Representation: From Vector to Tensor. In: IEEE International Conference on

Data Mining, ICDM, 2005. p.725-728.

[20] Munková, D., Munk, M., & Vozár, M. (2013). Data Pre-processing Evaluation for Text

Mining: Transaction/Sequence Model. Procedia Computer Science, 18(2013 International

Conference on Computational Science), 1198-1207. doi:10.1016/j.procs.2013.05.286

[21] Muñoz, A., & Álvarez, I. (2014). Esteganografía linguística en lengua española basada en

modelo N-gram y ley de Zipf. Arbor.

Page 49: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 49

[22] Ramesh, B., Xiang, C., & Lee, T. H. (2015). Shape classification using invariant features

and contextual information in the bag-of-words model. Pattern Recognition, 48894-906.

doi:10.1016/j.patcog.2014.09.019

[23] Ferilli, S., Esposito, F., & Grieco, D. (2014). Automatic Learning of Linguistic Resources

for Stopword Removal and Stemming from Text. Procedia Computer Science, 38(10th

Italian Research Conference on Digital Libraries, IRCDL 2014), 116-123.

doi:10.1016/j.procs.2014.10.019

[24] C., K. C., Anzola, J. P., & B., G. T. (2015). Classification Methodology Of Research

Topics Based In Decision Trees: J48 And Random tree. International Journal Of Applied

Engineering Research, 10(8), 19413-19424.

[25] Yan-yan, S., & Ying, L. (2015). Decision tree methods: applications for classification and

prediction. Shanghai Archives Of Psychiatry, 27(2), 130-135. doi:10.11919/j.issn.1002-

0829.215044

[26] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related

information: Review of current status and future directions. Sciencedirect.com. Retrieved

12 May 2015, from

http://www.sciencedirect.com/science/article/pii/S1386505614001105

[27] Ostrand, T., & Weyuker, E. (2007). How to measure success of fault prediction models.

Fourth International Workshop on Software Quality Assurance in Conjunction with the

6th ESEC/FSE Joint Meeting - SOQUA '07.

[28] Bowes, D., Hall, T., & Gray, D. (2013). DConfusion: A technique to allow cross study

performance evaluation of fault prediction studies. Autom Softw Eng Automated

Software Engineering, 287-313.

Page 50: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 50

[29] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related

information: Review of current status and future directions. Sciencedirect.com. Retrieved

12 May 2015, from

http://www.sciencedirect.com/science/article/pii/S1386505614001105

[30] Newzealand.com, (2015). Plantas y animales de Nueva Zelanda | Ruapehu, Nueva Zelanda.

Retrieved 11 July 2015, from http://www.newzealand.com/ar/feature/new-zealand-flora-

and-fauna/

[31] Hall, M., Frank, E., Geoffrey, H., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).

The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11,

Issue 1.

[32] Garcia, D., (2006). Weka Tutorial (Spanish). Retrieved 12 July 2015, from

http://www.metaemotion.com/diego.garcia.morate/download/weka.pdf

[33] Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing.

Communications of the ACM, 613-620.

[34] Namee, B. (2012). DIT MSc in Computing (Data Analytics): Text Analytics in Weka.

Ditmscda.blogspot.ca. Retrieved from http://ditmscda.blogspot.ca/2012/03/text-analytics-

in-weka.html

[35] Snowball.tartarus.org,. (2015). Defining R1 and R2. Retrieved 24 July 2015, from

http://snowball.tartarus.org/texts/r1r2.html

[36] Snowball.tartarus.org,. (2015). Spanish stemming algorithm. Retrieved from

http://snowball.tartarus.org/algorithms/spanish/stemmer.html

[37] Weka.wikispaces.com,. (2015). weka - GenericObjectEditor (book version). Retrieved from

https://weka.wikispaces.com/GenericObjectEditor+%28book+version%29

Page 51: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 51

[38] Cs.cmu.edu, (2015). Rainbow. Retrieved from

http://www.cs.cmu.edu/~mccallum/bow/rainbow/

[39] Weka.sourcearchive.com,. (2015). weka 3.6.0-3,

classweka_1_1filters_1_1unsupervised_1_1attribute_1_1StringToWordVector_a4ad7e64

ecb476e527a19afee2c96aea6.html. Retrieved from

http://weka.sourcearchive.com/documentation/3.6.0-

3/classweka_1_1filters_1_1unsupervised_1_1attribute_1_1StringToWordVector_a4ad7e

64ecb476e527a19afee2c96aea6.html

[40] Drazin, S., & Montag, M. (2015). Decision Tree Analysis using Weka. University of

Miami. Machine Learning – Project II. Retrieved from

http://ww.samdrazin.com/classes/een548/project2report.pdf

[41] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related

information: Review of current status and future directions. Sciencedirect.com. Retrieved

12 May 2015, from

http://www.sciencedirect.com/science/article/pii/S1386505614001105

[42] Jindala, R., & Tanejab, S. (2015). A Lexical Approach for Text Categorization of Medical

Documents. Procedia Computer Science 46 314 – 320.

[43] Bui, D., & Zeng-Treitler, Q. (2014). Learning regular expressions for clinical text

classification. J Am Med Inform Assoc. 2014 Sep-Oct;21(5):850-7. doi: 10.1136/amiajnl-

2013-002411.

[44] Pérez, A., Gojenola, K., Casillas, A., Oronoz, M., & Díaz de Ilarraza, A. (2015). Computer

aided classification of diagnostic terms in spanish. Expert Systems With Applications,

422949-2958. doi:10.1016/j.eswa.2014.11.035

Page 52: WEKA 1 Weka Valeria Guevara Thompson Rivers … Valeria Guevara Thompson Rivers University ... Pentaho its graphical interface is ... Although WEKA has not been used primarily for

WEKA 52

[45] Vilares, D., Alonso, M. A., & Gómez, C. (2015). A syntactic approach for opinion mining

on Spanish reviews. Natural Language Engineering, 21(1), 139.

[46] Pérez Abelleira, M. Alicia, & Cardoso, Alejandra Carolina. (2010). Minería de texto para la

categorización automática de documentos. Cuadernos de la Facultad 5.

[47] Shams, R. (2015). Weka Tutorial 31: Document Classification 1 (Application). YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=jSZ9jQy1sfE

[48] Shams, R. (2015). Weka Tutorial 32: Document classification 2 (Application). YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=zlVJ2_N_Olo

[46] Rodríguez, J., Calot, E., & Merlino, H. (2014). Clasificación de prescripciones médicas en

español. Sedici.unlp.edu.ar. Retrieved 15 May 2015, from

http://sedici.unlp.edu.ar/handle/10915/42402

[49] Weinberg, B. (2015). Weka Text Classification for First Time & Beginner Users. YouTube.

Retrieved 15 May 2015, from https://www.youtube.com/watch?v=IY29uC4uem8.

[50] Nlm.nih.gov,. (2015). PubMed Tutorial - Building the Search - How It Works - Stopwords.

Retrieved 18 May 2015, from

http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_170.html