weka 1 weka valeria guevara thompson rivers … valeria guevara thompson rivers university ......
TRANSCRIPT
WEKA 1
Weka
Valeria Guevara
Thompson Rivers University
Author Note
This is a final project COMP 4910 for the bachelors of computing science from the
Thompson Rivers University supervised by Mila Kwiatkowska.
WEKA 2
Abstract
This project focuses on documents classification using text mining through a
classification model generated by the open source software “WEKA”. This software is a
repository of machine learning algorithms to discover knowledge. Weka easily preprocesses the
training documents to compare different algorithms configurations. The exactitude in the
generated predictive model will be measured based on a confusion matrix. This project will help
to illustrate text mining preprocessing and classification using WEKA. The result will be the
development of a tool to generate the input data files arff and of a video tutorial on documents
classification in Weka in English and Spanish.
Keywords: Weka, documents classification, arff, stopwords, toquenizer, pruning,
decision tree C4.5, words vector, text mining, F-measurement, machine learning, text
classification, stemming, knowledge society.
WEKA 3
Weka
Weka is a native New Zealand bird that does not fly but has a penchant for shiny objects.
[30] Newzealand.com. (2015). Old legends from New Zealand narrate that these birds steal shiny
items. The University of Waikato in New Zealand started the development of a tool with that
name because this would contain algorithms for data analysis. Currently WEKA package is a
collection of algorithms for machine learning tasks of data mining. The package of Waikato
Environment for Knowledge Analysis contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. [31] Hall, M., Frank, E., Geoffrey H.,
Pfahringer, B., Reutemann, P., & Witten, IH (2009). This software analyzes large amounts of
data and decide which is the most important. It aims to make automatic predictions that help
decision making.
Weka VS Other Machine Learning Tools
There are other tools for data mining as RapidMiner, IBM Cognos Business Intelligence,
Microsoft SharePoint and Pentaho. IBM Cognos Business Intelligence provides a not very user-
friendly display. Microsoft SharePoint creates predictive models of mining business but their
information is not their main objective. Where RapidMiner offers a great display of results, but
the datasets are loaded slower than in Weka. Pentaho its graphical interface is not difficult to
understand to describe your options as Weka does.
The tool implements Weka machine learning techniques implemented in easy to learn
java under a GNU General Public License. WEKA provides three ways to be used, through its
graphical interface, command line interfaces and application code in Java API interface
WEKA 4
language. Although WEKA has not been used primarily for troubleshooting predictions in
business, this helps the construction of new algorithms. Therefore it turns out to be the most
optimal software for initial data analysis, classification, clustering algorithms, research.
In this project the Weka tool is used to create a predictive model using text classification
algorithms of machine learning algorithms.
Installation
Weka can be downloaded at: http://www.cs.waikato.ac.nz/~ml/weka/. In this case we
speak of the latest version 3.6.12 Weka. In the same URL you can find instructions for
installation on different platforms.
In Windows Weka must be located in launcher program in a folder version of Weka
downloaded, in this case the latest version is weka-3-6. Weka default directory is same directory
where the file is loaded.
Linux will have to open a terminal and type: java -jar /installation/directory/weka.jar.
It is common to find an error of insufficient memory, which in turn is achieved by
specifying for example GB 2GB will "-Xmx2048m" in the setup files. Further information
weka.wikispaces.com/OutOfMemoryException be found. You can be ordered with the -Xms and
-Xmx parameter indicating the minimum and maximum RAM respectively.
In windows you can edit the file RunWeka.bat RunWeka.ini or the installation directory
should be changed Weka maxheap = 128m = 1024m maxheap line. You can not assign more
than 1.4G to JVM. You can also assign to the virtual machine with the command:
java -Xms <minimum-memory-mapped> M
WEKA 5
-Xmx <Maximum-memory-mapped> M -jar weka.jar
[32] Garcia, D., (2006).
In linux the -XmMemorySizem option is used, replacing MemorySize the required size in
megabytes. for instance:
java -jar -Xm512m /instalación/directorio/weka.jar.
Execution
Weka The first screen will show a coach you are interfaces called "Applications" where
in this version of Explorer, Experimenter, KnowledgeFlow sub-CLI and Simple tools are
deployed. Explorer is responsible for conducting exploration operations on a data set.
Experimenter experiments performed statistical tests to create an automated manner different
algorithms different data. KnowledgeFlow shows graphically the operation panel work Weka.
Simple CLI or single client that provides the command line interface to enter commands.
The main user interface is "Explorer" consists of six panels. Preprocess is the first
window to open this interface. In this window, the data are loaded. Weka accepts load the data
set from a URL, database, CSV or ARFF files. The ARFF file is the primary format to use any
classification task in WEKA.
Input data.
As previously it was described, three data inputs are considered in data mining. These are
the concepts, instances and attributes. An Attribute-Relation File Format file is a file that
describes a concept list of instances with their respective attributes. These files are used by Weka
for text classification and clustering applications.
WEKA 6
ARFF files.
These files have two parts, the header information and data information. The first section
contains the name of the relationship with the attributes (name and type). The relationship name
is defined in the first line of arff where name-relation is a string with the following format:
@relation <relation-name>
The next section is the attribute declarations. This is an ordered sequence of statements of
each attribute instances. These statements uniquely define an attribute name and data type. The
order in which the attributes are declared indicates the position where you are in the instances.
For example, the attribute that is declared at the first position is expected in all instances at the
first position state the value of this attribute. The format for its declaration is:
@attribute <attribute-name> <data type>
Weka has several data-type supported:
i) NUMERIC: are all real numbers where the separation between the decimal and integer
part is represented by a point and not a comma.
ii) INTEGER: treated as numeric.
iii) NOMINAL provide a list of possible values for example {good, bad}. These express
the possible values that the attribute can take the following format:
@attribute attribute _name {<nominal1>, <nominal2>, <nominal3>, ...}
iv) STRING: is a sequence of text values. These attributes are declared, as follows:
@attribute attribute _name string.
v) DATE: dates and times are declared as:
@attribute <name> Date [<date format>].
WEKA 7
Where <name> is the name of the attribute and <date format> is an optional string
consists of characters hyphens spaces and time units. The date format Specify the
values to date should be analyzed. The format set accepts the combination of
format ISO-8601: aaaa-MM-dd'T'HH: mm: ss. Example:
@attribute timestamp DATE "yyyy-MM-dd HH: mm: ss"
vi) Relational attributes are data attributes for multiple instances in the following way:
@attribute <name> relational
<Attribute definitions Next>
@end <name>
There exist rules on the attribute statements:
a) The names of relations as string or string must be enclosed in double quotes "if
it includes spaces.
b) Both the attributes and relationships names cannot start with a character before
the \ u0021 ASCII '{', '}', ',', or '%'.
c) Values that contain spaces must be quoted.
d) Keywords numeric, real, integer, string and date can be case insensitive.
e) Relational data must be enclosed in double quotes ".
The second section is the statement of information. It is declared as @data on one line.
Each line below represents an instance defining attributes with commas. The attribute value must
be in the same order in which they were found in one section attribute. Missing values are
represented with a trailing question mark "?". The string values and nominal attributes are
WEKA 8
different between upper and lower case. It should be cited any value that contains a space.
Comments are cited delimiter character "%" to the end of the line.
In text classification, arff files represent the entire document as a single text attribute that
is of type string. The second attribute to consider is the class attribute. This will define the class
instance belongs. This type of attribute can be of type string or nominal. An example of the
resulting text file is the document type and the type string nominal class of two values:
@relation language
@attribute DocumentText string
@attribute class {English, Spanish}
@data
'texto a clasificar aquí... ', español
'Classify text here ...', English
Data preprocessing.
In this window, data are loaded and may be edited. Data can be manually modified with
edition or filtering. Filters are learning techniques methods that modify the data set. Weka has a
variety of filters structured hierarchically in supervised and non-supervised where the root is
weka. These filters are divided into two categories as a result of the way they operate with data
attribute and instance.
As point out earlier, these techniques are classified in a way that depends on the input
data relationships. Unsupervised learning techniques as descriptive inductive models do not
know their correct classification. This means that the instances do not require an attribute that
declares the class. Inductive techniques of predictive supervised learning depend on the class
values to which it corresponds. This refers to instances will contain a class attribute that
corresponds which they belong.
WEKA 9
In Current relation module the dataset that has been loaded is described as the name,
and instances number. Attributes allows to select attributes using options from All, None, Invert
and it further provides the option to enter a regular expression. In the Selected attribute part
display information about the selected attribute. At the bottom is illustrated a histogram of the
attributes selected in Attributes.
Preprocessing for classifying documents
In Weka is possible to create documents classification models into categories previously
analyzed. The documents in Weka usually need to be converted into "vectors text" before
applying machine learning techniques. For this the easiest way to render text is as bag of words
or word vector. [34] Namee, B. (2012). StringToWordVector filter performs the process of
converting the string attribute to a set of attributes that represent the occurrence of words of the
full text. The document is represented as a text string in a single attribute type string.
StringToWordVector Filter
This is the fundamental text analysis WEKA filter. This class offers abundant choices of
natural language processing, including the use of lexematización for convenient corpus, custom
tokens and using various lists of empty words. At the same time, it calculates weights Frequency
and Duration TF.IDF etc.
StringToWordVector places the class attribute to the top of the list of attributes. To
change the order it can use the filter Reorder to reorder. This filter can be configured all the
techniques of linguistic natural language processing to attributes. To apply the filter
StringtoWordVector in batch mode from the command line can be done as follows:
WEKA 10
Java -cp/Aplicaciones/weka-3-6-2/weka.jar
weka.filters.unsupervised.attribute.StringToWordVector -b -i
datos_entrenamiento.arff -o vector_ datos_entrenamiento.arff -r
datos_prueva.arff vector_ data_ prueva .arff
The sets datos_entrenamiento are the training set, vector_ datos_entrenamiento are the
training set vector, datos_prueva are the test set and vector_ data_ prueva are the test set
vector. The -cp option puts Weka jar in the class path, use -b indicates the batch mode, -i file
specifies the training data, -o output file after processing the first file, -r is my file Test and -S is
the output file of the previous test file.
Options can be modified in the user interface, when you click on the filter name beside
the choose button. Having previously selected the filter from Booton choose.
Having the window open weka.filters.unsupervised.attribute.StringToWordVector show
the following to be modified according to the needs of the documents to be classified. The
options are:
IDFTransform
TFTransform
attributeIndices
attributeNamePrefix
doNotOperateOnPerClassBasis
invertSelection
lowerCaseTokens
minTermFreq
normalizeDocLength
outputWordCounts
periodicPruning
stemmer
stopwor
tokenizer
useStoplist
wordsToKeep
WEKA 11
In Weka.sourcearchive.com [39] refers to a mental map of Weka options which are as
follows shown in the following illustration is:
WEKA 12
wordsToKeep
Defines the number N of words per class limit, if there is a class attribute which is trying
to maintain. In this case only the more common N terms among all attribute values in the chain
will remain. Higher values represent lower efficiency because it will take more time learning
model.
doNotOperateOnPerClassBasis
Flag that set to keep all relevant words for all classes. It is set to true when the maximum
number of words and the minimum term often does not apply to an attribute of a class, instead it
is based on all classes.
TFTransform
Term frequency score (TF) Transformation: when position the flag as true, this filter
executes the transformation term-frequency score representing textual data in a vector space the
term-frequency (TF) is used. The TF represents numerical measure the words of the text
relevance among the entire collection. This not only considers the relevance of a single term
itself, it also contemplates the relevance in the entire collection of documents.
Mathematically its represented as the function TF (t, d) which expresses the term t in the
document d is as: log (1 + t word frequency on the instance or document d). The inverse
document frequency IDF is the number of documents containing the term t appear where t is
defined in the TF. It find words often related in terms of log (1 + IJF) where fij is the frequency
of the word t in the document (example) j.
DFTransform
Inverse Document Frequency (IDF) Transformation: positioning the flag with "true" will
define the use of the following equation:
WEKA 13
t word frequency in instance d as ftd and as a result:
F td * log (nº documents and instances d / nº of documents with word t)
This is explained taking into account set D which includes all documents in the collection
represented as D = {d1, d2, ..., dn}. It finds out most relevant documents to the other fij * log (nº
Docs / nº nº of Documents with the i word) where fij is the frequency of word i in document j.
By multiplying IDF by the TF the result assign more weigh to the terms with greater
frequency in the documents but at the same time relatively rare in the collection of documents.
Weight [33] Salton, G., Wong, A., & Yang, C. (1975).
outputWordCounts
Counts the words occurrences in the string, the default settings only reports the presence
or absence as 0/1. The result is a vector where each dimension is a different word. The value in
this dimension is a binary 0 or 1 is say yes or no is the word in that document.
The frequency of the word in that document is represented as an integer number with:
IDFTransform and TFTransform as "False" and outputWordCounts to "True" opccions.
This is enable to do an explicit words account. It is established as "false" when only cares
about the presence of a term, not its frequency.
To calculate tf * (IDF) must be set IDFTransform as True, TFTransform as false and
outputWordCounts set as True.
To achieve log (1 + tf) * log (IDF) TFTransform must be set to True.
normalizeDocLength
It is set true to determine whether the words frequency in an instance must be normalized.
Normalization is calculated as Actual Value * Average Document Length / Document Length .
WEKA 14
This option is set with three sub-options, the first option "No normalization". The second is
"Normalize all data" that takes a measure as a common scale of all measures taken in the various
documents. The third option is "Normalize test data only." It has a word with a real value of the
tf-idf result of the word in that document with the settings as follows IDFTransform and
"TFTransform" to "True" and "normalizeDocLength" to "Normalize all data."
Stemmer
Selects the stemming algorithm to use in the words. Weka by default supports four
default stemmer algorithms. Lovin Stemmer algorithm is its iterated version and supports
Snowball stemmers. IteratedLovinsStemmer algorithm is a version of the algorithm
LovinsStemmer which is a set of transformation rules for changing word endings as well as
words present participle, irregular plurals, and morphological English. NullStemmer algorithm
performs any derivative at all. The algorithm SnowballStemmer came standard vocabularies of
words and their equivalents roots.
Weka can easily add new algorithms stemmer because it contains a wrapper class for as
snowball stemmers in Spanish. Weka contains all algorithms snowball but can be easily included
in the location of the class weka.core.stemmers.SnowballStemmer Weka.
Snowball is a string processing language designed for stemming creation. There are three
ways to get these algorithms, the first is to install the unofficial package. The second is to add
snowball-20051019.jar pre-compiled class location. The third is to compile the latest stemmer by
itself from snowball-20051019.zip. The algorithms are in snowball.tartarus.org that have a
stemmer in Spanish. In the following link you can see examples and download this stemmer:
http://snowball.tartarus.org/algorithms/spanish/stemmer.html
WEKA 15
Snowball Spanish Stemming Algorithm comes from Snowball.tartarus.org. It defines an
usual R1 and R2 regions. Furthermore RV is defined as the following vowel after the region if
the second letter is a consonant, or RV and after the following consonant the region, if the first
two letters are vowels, or RV as the region also after the third letter if these options do not exist
RV is the end of the word.
Step 0: Search the longest pronoun between the following suffixes: "I selo selos selas is
SELA's you what the will of us" and remove it, if it comes after one of iendo ar Ando ír
ER'm iendo ar er get going.
Step 1: Look in the longest common suffix and deletes it.
Step 2: If no suffix is not removed in step 1 seeks to eliminate other suffixes.
Step 3: Find the longest among the residual suffixes “os a o á í ó e é” in RV and
eliminates them.
Step 4: remove sharp accents. [36]
. For more information about suffixes in step 1 and 2 go to snowball
http://snowball.tartarus.org/algorithms/spanish/stemmer.html page.
The previous algorithm will be added into weka when the following command for
Windows is applied:
java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser
For Linux:
java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser
[37] Weka.wikispaces.com ,. (2015).
WEKA 16
The jar snowball-20051019.jar previously compiled and stored in the location where the
application of Weka on the computer.
It may confirm with the command:
java weka.core.SystemInfo
As shown in the figure below.
Stopwords
This are terms that are widespread and appears more frequently and do not provide
information about a text. This option determines whether a sub string in the text is an empty
word. Stopwords terms come from predefined list. This option converts all words to lowercase
before term removal. Stopwords it is pertinent to eliminate meaningless words within the text
and eliminate frequent and useful words of decision trees. Weca´s stopwords by default are
based on the Rainbow lists that are found in the next link:
http://www.cs.cmu.edu/~mccallum/bow/rainbow/.
WEKA 17
Rainbow is a program that performs statistical text classification. It is based on the Bow
library. [38] Cs.cmu.edu, (2015). The format of these lists is one word per line, where each
comments must start with '#' to be omitted. WEKA is configured with a list of empty words
English but you can set different lists of stopwords. You can change this list from the user
interface by clicking on the option you have Weka by default uses Weka-3-6 list but it can
choose any location that points to a desired list. Rainbow has separate lists for English and
Spanish, in order to make both languages the "ES-stopwords" add both lists from Rainbow.
useStoplist:
Flag to use empty words. If is set to "True" ignores the words that are in the predefined
stopwords list from the previous option.
Tokenizer:
Choose measurement unit to separate each text attribute from the arff. This has three sub
options. The first is AlphabeticTokenizer where only alphabetical symbols are continuous
sequences that cannot be edited. When tokenize only considers the alphabet in English. At the
same time there is WordTokenizer option that establishing a list of delimiters. As was referenced
previously, punctuation in Spanish is ";:.?!?! - - () [] '" << >> ". In Spanish, unlike English
contemplates a sign of the beginning and another end in an exclamation.
The second is NGramTokenizer that divides the original text string in a subset of
consecutive words that form a pattern with unique meaning. Its parameters are derived
"delimiters" to use that default is '\ r \ n \ t,;:.' "()! 'GramMaxSize which is the maximum size of
the Ngram with a default value of 3 and GramMinSize be the minimum size of the Ngram with a
WEKA 18
default value of 1. N-grams can help uncover patterns of words between them which represent a
meaningful context.
minTermFreq:
Sets the minimum frequency that each word or term must possess to be considered as an
attribute, the default is 1. It is often applied when class has an attribute that has not been set to
true flag "doNotOperateOnPerClassBasis" the text of the entire chain for a particular class that is
in that same attribute is selected tokenisa.
The frequency of each token is calculated based on its frequency in the class. In contrast,
if there is no class, the filter will calculate a unique dictionary and the frequency is calculated
based on the entire attribute value chain of the chosen attribute, not only those related to a
particular class value.
periodicPruning
Eliminates low-frequency words. It uses a numerical value as a percentage of the size of
the document that sets the frequency to prune the dictionary. The default value is -1, meaning no
periodic pruning. Periodic pruning rate is specified as a percentage of the data set. For example,
this specified that 15% of each set of input data, regularly pruned in the dictionary, after creating
a comprehensive dictionary. May not have enough memory for this approach.
attributeNamePrefix
Sets the prefix for the names of attributes created, by default is "". This only provides a
prefix to be added to the names of the attributes that the filter StringToWordVector created when
the document is fragmented.
WEKA 19
lowerCaseTokens
Flag when its set to "True", converts all words in the document into lowercase before
being added to the record. Flag true eliminate the option to distinguish themselves by eliminating
the rule names that begin with uppercase names. Acronyms may be considered when this option
to is set to "False".
attributeIndices
Sets the range of attributes to act on the sets of attributes. The default is first-last which
ensures that all attributes san treated as if they were a single chain from first to last. This range
will create a chain of ranges containing a comma-separated list.
invertSelection
Flag to work with the attributes selected in the range. It stands as true to work with the
unique attributes unselected "true" or. The default value is "False" is work with the selected
attributes.
After cleaning the data on the tab "Preprocess" vector attributes are analyzed to obtain the
desirable knowledge in the "Classify" tab.
Classification
The second panel of Explorer is "Classify" or classification generated by machine
learning model from the training data. These models serve as a clear explanation of the structure
found in the information analyzed. Weka especially considering the model J48 decision tree for
the most popular text classification. J48 is the Java implementation of the algorithm C4.5.
Previously described as the algorithm that each branch represents one of the possible choices in
the if-then format that the tree offers to represent the results in each leaf. It can summarized the
WEKA 20
C4.5 algorithm as the amount of measurement of the information contained in a data set and
grouped by importance. The idea of the importance of a given attribute in a dataset. J48 Print
recursively the tree structure variable of type string by accessing information stored in each
attribute nodes.
To create a classification, you must first choose the algorithm classifier in the “Choose”
button located in the upper left side of the window. This button will display a tree where the root
is Weka and sub folder is "classifiers". Within the sub folder tree located in weka.classifiers.trees
tree models such as J48 and RepTree are found. RepTree combines the standard decision tree
with random forest algorithm. To access the classifier's options are given double-click the name
of the selected classifier.
"Test Options".
The classification has four main modes and others to manage the training data. These are
found in the section "Test Options” with the following options
a) Use training set: training method with all available data and apply the results on the
same dataset collection.
b) Supplied test set: select training data set froma file or URL. This set must be
compatible with the initial data and is selected by pressing "Set" button.
c) Cross-validation: performs a cross-validation depending on the number of "Folds"
selected. Cross-validation specify a number of partitions to determine how many
temporary models will be created (Folds). First a part is selected, then a classifier is built
from all parts are except the selected one that remains for testing. [32] Garcia, D., (2006).
d) Percentage Split: define the percentage of the total input from the classifier model was
built and the remaining part will be tested.
WEKA 21
Weka allows us to select more than a few options for defining the test method with the
"More Options" button, these are:
Output Model: open in the output window pattern classifier.
Output per-class stats: display statistics for each class.
Output entropy evaluation measures: displays measurement information entropy in the
standings.
Output confusion matrix: displays the resulting confusion matrix classifier.
Store predictions for visualization: Weka will keep classifier model predictions as in the
test data. In the case of using this option classifier J48 will show the tree errors.
Output predictions: show a table of the real and predicted values for each instance from
test data. It states the relation between the classifier and each instance in the test data.
Output additional attributes: is set to display the values of attributes, not those of the
class. A range will be specified to be included along the actual and predicted values of
the class.
Cost-sensitive evaluation: produce additional information on the output of the
assessment, the total cost and average cost of misclassification.
Random seed for xcal /% Split: specifies the random seed used when before data have
been divide for evaluation purposes.
Preserve order for% Split: Retains the order in the percentage of data instead of creating
a random for the first time with the value of the default seeds is 1.
Output source code: generate the Java code model produced by the classifier.
WEKA 22
In the event that does not have a set of data independent evaluation it is necessary to
obtain a reasonably accurate idea of the generated model and select the correct option. In the
case of classifying documents is recommended select at least 10 "Folds" for cross-validation and
assessment approach. It also recommends allocating a large percentage of "Percentage Split".
Below these options "Test Options", it is a menu where a list with all attributes will be
find. This allows you to select the attribute that act as the result for classification. In the case of
the classification of documents will be the class to which the instance belongs.
The classification method start by pressing the "Start" button. The image of the weka bird
found in the bottom right will start to dance till the classifier achieves complete.
WEKA creates a graphical representation of the classification tree J48. This tree can be
viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree"
option. The window size can be adjusted by right-clicking and select “Fit to Screen”.
Classifier for classifying documents J48
The model J48 uses the decision tree algorithm C4.5 to build a model from selected
training data. This algorithm is found in weka.classifiers.trees. J48 classifier has different
parameters that can be edited by double clicking on the name of the selected classifier.
J48 employs two pruning methods, but this does not make the pruning of error. The main
objectives of pruning are to make the tree easier to understand and reduce the risk of overuse of
the training data in the direction of be able to classify just about perfectly. The tree learn the
specific properties of the training data and not the lower concept.
WEKA 23
The first J48 pruning method is known as replacement subtree. The nodes in a decision
tree can be replaced with a leaf by reducing the number of nodes in a branch. This process starts
from the fully formed leaves and work up towards the root.
The second is to raise the hive. A node is move to the tree root and replaces other nodes
in the branch. Normally, this process is not negligible and is wise turn it off when the induction
process takes time.
By clicking on the name of the J48 classifier which is located right next to the “Choose"
will display a window with the following editable options:
confidenceFactor sets the number of pruning. Lower values experience more pruning.
Reducing this value may reduce the size of the trees and also helps in removing irrelevant
nodes that generate misclassification. [40] Drazin, S., & Montag, M. (2015).
minNumObj: Sets the minimum number of instances separation per leaf in the case of
trees with many branches.
unpruned: flag to preform pruning. In true the tree is pruned. Default is "False" which
means that pruning is not carried out.
reducedErrorPruning: flag to use pruning error reduction in C.4.5 tree. Method after
pruning using a resistance to the errors estimations. Similarly, it is for breeding hives and
throw an exception not the confidence level used for pruning.
Seed: Seed number shuffle data randomly and reduce error pruning. This is considered
when reducedErrorPruning flag is set to "True". The default seed is 1.
numFolds: number of pruning to reduce error. Sets the number of folds that are retained
for pruning, with a set used for pruning and the rest for training. To use these Folds
reducedErrorPruning flag must be set to "True".
WEKA 24
binarySplits: when this flag is set "True", it creates only two branches for nominal
attributes with multiple values instead of a branch for each value. When the nominal
attribute is binary there is no difference, except in how this attribute is shown in the
output result. The default is "False".
saveInstanceData: flag set to "True" to store training data for its visualization. The
default is "False".
subtreeRaising: flag to preform pruning with the subtree raising method. This moves a
node to the tree root replacing other nodes. In "True" weka considered subtreeRaising in
the process of pruning.
useLaplace: flag that preform a leaves count in Laplace. Set to "True", weka will count
the leaves that become smaller based on a popular complement to estimates probability
called Laplace.
debug: banner to add information to the console. In "True", it adds additional information
to the console of the classifier.
It can reach 100% correct in the training data clearing pruning and establish the
minimum number of instances on a sheet 1.
WEKA 25
Weka document classification
Weka tool was selected in order to generate a model that classifies specialized documents
from two different courpus (English and Spanish). WEKA package is a collection of machine
learning algorithms for data mining tasks. Text mining uses these algorithms to learn from
examples or "training set", new texts are classified into categories analyzed. It is defined as
Waikato Environment for Knowledge Analysis. For more information contact
http://www.cs.waikato.ac.nz/~ml/weka/.
Installing WEKA
Weka can be downloaded from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
In this tutorial version is Weka 3.6.12.
For Windows
WEKA must be situated in the program launcher located in a weka folder. The Weka
default directory is the same directory where the file is loaded.
For Linux:
WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.
WEKA 26
Based on the text mining methodology Weka is represented in a framework with four
stages, data acquisition, document preprocessing, information extraction and evaluation.
Data Acquisition
ARFF files are the primary format to use any classification task in WEKA. These files
considered basic input data (concepts, instances and attributes) for data mining. An Attribute-
Relation File Format file describes a list of instances of a concept with their respective attributes.
The documents selected for the training data set has been found on the Thompson Rivers
University library that has the following link: http://www.tru.ca/library.html. It was randomly
selected 71 medical academic articles in English and Spanish. These documents are stored in
Portable Document Format (PDF). Based on the TRU library was detected the classification of
this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes
recognized. These documents are stored in directories named by its categories within the main
folder called Medicine. As shown in the figure below.
In order to form an arff file it was created in Microsoft Visual Studio Professional C #
2012 an application that generated the arff from a directory that contains a collection of
WEKA 27
documents in a based on their category name. This application could be carried out with the
collaboration of a library called iTextSharp PDF for a portable document format text extraction.
Documents Directory to ARFF can specify the name of the relationship to define, the
location of the home directory that contains all documents subdivided into categorical directories
and comments required. Also, it specify the file name generated with arff extension and its
location. At the end of the application are two buttons, one for exit and another to generate the
arff file with the information described.
This can be download http://www.scientificdatabases.ca under current projects for Text
Mining.
The resulting arff generate a string type attribute called " textoDocumento" that describe
all text found in the document and the nominal attribute "docClass" that define the class to
which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class
attribute can never be named "class".
WEKA 28
The file will be generated as follows:
% tutorial de Weka para la Clasificación de Documentos.
@RELATION Medicina
@attribute textoDocumento string
@attribute docClass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, Diabetes}
@data
"texto…", Hemodialysis
“texto…”, Nutrition
"texto….", Cancer
"texto…", Obesity
"texto…", Diet
"texto…", Diabetes
Document Preprocess
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.
"Applications" is the first screen on Weka to select the desired sub-tool. In this
"Explorer" is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select
attributes and Visualize.
Preprocess
Preprocessing for the classification of documents.
To load the generated arff, click on the button "Open file ..." at the top right.
Select the created file "medicinaWeka.arff".
On "Current Relation" the dataset that has been loaded is described. It describes the
relationship with the medicina name, the number of instances as 71 and a total of attributes as 2.
At the bottom of the under "Attributes" section, attributes are described. This framework allows
to select the attributes, in this case are show " textoDocumento " and "docClass".
When selecting "docClass" the "Selected attribute" part describes the nominal attribute
with 6 labels and the total of its instances. These "labels" are 11 levels from Hemodialysis and 12
WEKA 29
instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this
section is ilustrated a histogram of the attribute "docClass" labels that by hovering the graph it
will describe the attribute name as shown in the following figure illustrates.
Weka uses StringToWordVector filter to convert the "textoDocumento" and
"docClass"." attribute into a set of attributes that represent the occurrence of words of the full
text,. This filter is a technique of unsupervised learning. These inductive technique is designed to
detect clusters and label entries from a set of observations without knowing the correct
classification.
The filters are found when click the “Choose " button under "Filter" section. This button
opens a window with root weka. From there selecte filters and the unsupervised folder to after
select attribute and finally select StringToWordVector.
WEKA 30
StringToWordVector filter can configured its attributes with language processing
techniques. To edit this filter is only necessary to click on the filter name. it will open a that
show the following options.
They were generated a set of optimal options from different combinations of options
applied to the same training data . Each resulting model was calculated its F-measurement which
describes the proportion of its predicted instances erroneously. The options that generated the
greatest number of instances predicted correctly are as follows:
a) wordsToKeep: Standing with 1000 since it defines the word limit per class to maintain.
Where doNotOperateOnPerClassBasis flag: as "False" to base wordsToKeep in all
classes.
b) TFTransform as "True", DFTransform as "True" outputWordCounts as "True" and
normalizeDocLength: is set to "No normalization".
The values are not normalized to the filter papers find more interrelated and count how
often a word is in the document and not only consider whether the term is in the
document. OutputWordCounts is the flag that describes whether a word exist or not in the
document and normalizeDocLength couts a word with its actual value from tf-idf result
of that word in the document, no matter how small or longer the document is.
c) lowerCaseTokens: as "True" to convert all to lowercase words before being added to
the record and analyze the same word in lowercase and uppercase separately.
WEKA 31
d) Stemmer: selects the algorithm to elimination the morpheme in a given language in
order to reduce the word to its root. Select no stemmer as the classification of texts is
multilingual and it will only aply stemming for one lenguage. No stemmer is configured
when click on the "Select" button menu is deployed and "NullStemmer" is selected.
Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a
string processing language designed for creating stemmer and feature a stemming
algorithm in Spanish. To use the algorithm in Spanish will have to download the jar
snowball-20051019.jar from https://weka.wikispaces.com/Stemmers. This will be stored
in the location where Weka application is. Finally the algorithm will be added when the
following command is applied from the command line in Weka.
For Windows: java -classpath "weka.jar, snowball-20051019.jar" weka.gui.GUIChooser
For Linux: java -classpath "weka.jar: snowball-20051019.jar" weka.gui.GUIChooser
It will be confirmed with the command to verify the parameter java.class.path
java weka.core.SystemInfo
As shown in the following figure:
WEKA 32
Having set the SnowballStemmer, Selecte it by clicking the "Choose" button.
This button will display a menu which selecte from weka> core> stemmers and choose
SnowballStemmer.
Click on the stemmer name and a window that can delimit the language will apear. For
Spanish on the side labeled "stemmer" it will be type "spanish" in place of "porter" and
click "OK".
e) Stopwords determines whether a sub string in a text is a word that does not provide
information about a text. This words come from a predefined Rainbow list, where the
default is Weka-3-6. Rainbow is a program that performs the statistical text
classification base on Bow library. Rainbow has separate lists in English and Spanish,
in order to make both languages is use the "ES-stopwords" file that contains both lists
from Rainbow. "ES-stopwords" list can be download from
http://www.scientificdatabases.ca/current-projects/english-spanish-text-data-mining/.
To change the list click on Weka-3-6 which is next to the label stopwords and
choose “ES-stopwords" previously downloaded. Set the useStoplistse option to
WEKA 33
"True" to ignore the words that are on "ES-stopwords" within the "Stopwords" option
list.
f) Tokenizer: option to choose unit to separate the attribute "DocumentText". By
clicking "Choose" button a menu will be displayed and select "WordTokenizer". Set
the "deimiters" in English and Spanish when cloc on the name and following window
will appear. Delimiters in Spanish are,;: .,;:'()?!“¿!-[]’<>“ ".. this includes an end
character in for exclamation and interrogation. .,;:'"()?!“¿!-[]’<>“
As shown in the figure below.:
Another option is to choose NGramTokenizer to divide the original text string in a
subset of consecutive words that form a pattern with unique meaning. This uses the
default "delimiters" is '\ r \ n \ t,;:.' ?! "()", This is useful to help uncover patterns of
words between them representing a meaningful context.
g) minTermFreq: default is 1 for each word must to possess to be considered as an
attribute to this the "doNotOperateOnPerClassBasis" flag should be "False".
h) periodicPruning be filed in no pruning with -1, it won’t remove low-frequency
words.
WEKA 34
i) attributeNamePrefix lefts with nothing to not add a prefix to the attributes
generated.
j) attributeIndices: will be saved as first-last to ensure that all attributes are treated as
if they were a single chain from first to last.
k) invertSelection be preserved in "False" to work with the selected attributes.
At the end, you can save, cancel and apply. The window must have been as follows:
WEKA 35
To save the algorithm with these options click on Save ..." button and the select the
location and name.
To apply the algorithm with these options in the click "OK" button. This will return to the
"Preprocess" window where "DocumentText" attribute must have been selected from the
"Attributes" framework.
Click the button "Apply". It is located in the upper right of the module "Filter". Weka
image located in the lower right corner will start to dance until the process is complete.
Information extraction
After the data cleaning on the "Preprocess" tab, it proceeds to the extraction of
information. By click on the tab "Classify" on the second panel of Explorer.
This stage analyze the attributes vector for the creation of the classification model that
will define the structure found in the analyzed information.
Weka considered the decision tree model J48 the most popular on text classification. J48
is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of
the possible decisions to be taken and each leave represent the predicted class.
First, choose the sorting algorithm from the "Choose" button located in the upper left side
of the window.
WEKA 36
This button will display a tree where the root is weka and the sub folder is "classifiers".
Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as
shown in the following figure:
Double-click on the name of the J48 classifier located next to the "Select"
button to access to its options.
WEKA 37
It can reach 100% in correct classification disabling pruning and setting the
minimum number of instances in a leaf as 1. In this case these parameters changed
are:
a) minNumObj: is set to 1 and leave the other parameters in the default configuration.
In the "Test Options" module the training data is set.
Select “Use training set" to train the method with all available data and apply the results
on the same input data collection.
WEKA 38
Additionally you can apply a partitioning percentage to the input data by selecting the
"Percentage Split" option and defining the percentage from the total input data to build the
classifier model, leaving the remaining part to test.
Under options "Test Options" is a menu that displays a list with all attributes. In the case
select "docClass" because this is the attribute that act as the result for classification in this
example.
The classification method started by pressing the "Start" button.
The weka bird image found in the bottom right, will begin to dance until the end of the
sorting process.
WEKA 39
WEKA creates a graphical representation of the classification tree J48. This tree can be
viewed by right-clicking on the last set of results "Result List" and selecting "Visualize tree" or
"tree Display" option.
WEKA 40
The window size can be adjusted to make it more explicit by right clicking and selecting
"Fit to Screen", as show in the image below.
Results Evaluation
Weka describes the proportion of instances erroneously predicted with the measure - Fβ
score. The value is a percentage consist of precision and Recall. Precision measures the
percentage of correct positive predictions that are truly positive Recall is the ability to detect
positive cases out of the total of all positive cases.
WEKA 41
With these percentages it is expected that the best model is the F-measure value closer to
1. The following table shows some combinations that are significant in the data preprocess for
model generation. This comparison table describes its measures of precision and recall as well as
its measurement-f.
First the best filter options are analyzed with unadjusted values for the J48 classifier. In
this the best parameters are selected. After the best settings for J48 classifier algorithm are
selected with the best configuration on the StringToWordVector filter.
Comparison table: Documents classification models.
Features Precision Recall F-Measure
Word Tokenizer English Spanish (E&S ) 0.810 0.803 0.800
Word Tokenizer E&S + Lower Case Conversion 0.863 0.859 0.860
Trigrams E&S + Lower Case Conversion 0.823 0.775 0.754
Stemming + Word Tokenizer E&S + Lower Case
Conversion
0.864 0.817 0.823
Stopwords + Word Tokenizer E&S + Lower Case
Conversion
0.976 0.972 0.972
Stopwords + Stemming +
Word Tokenizer E&S + Lower Case Conversion
0.974 0.972 0.971
Stopwords + Word Tokenizer E&S + Lower Case
Conversion + J48 minNumObj = 1
1 1 1
In conclusion the best model is a combination of the options Word Tokenizer Stopwords
+ S + E & Lower Case Conversion applied to the filter on the data preprocessing and further
adjusting 1 minNumObj on the J48 classifier algorithm.
WEKA 42
The next confusion matrix is the result from the combination of Stopwords + Word
Tokenizer E&S + Lower Case Conversion adjusting minNumObj to 1 on the J48 algorithm.
This generates the following binary values in their confusion matrix.
a b c d E f Classified as
11 0 0 0 0 0 a = Hemodialysis
0 12 0 0 0 0 b = Nutrition
0 0 12 0 0 0 c = Cancer
0 0 0 12 0 0 d = Obesity
0 0 0 0 12 0 e = Diet
0 0 0 0 0 12 f = Diabetes
This table only shows classes with precision and recall at 100%. Accuracy values are as
follows for each class:
Class TP Rate FP Rate Precision Recall F-Measure
Hemodialysis 1 0 1 1 1
Nutrition 1 0 1 1 1
Cancer 1 0 1 1 1
Obesity 1 0 1 1 1
Diet 1 0 1 1 1
Diabetes 1 0 1 1 1
Weighted Avg. 1 0 1 1 1
WEKA 43
Conclusion
Document classification in Spanish is analyzed using text mining through Weka an open
source software. This software analyzes large amounts of data and decide which is the most
important. It aims to make automatic predictions that help decision making. When comparing
WEKA with other data mining tools as RapidMiner, IBM Cognos Business Intelligence,
Microsoft SharePoint and Pentaho, weka provides a friendly interface easy to understand, load
data efficiently and consider data mining as main objective.
Text mining seeks patterns extraction from the analysis of large collections of documents
in order to gain new knowledge. Its purpose is the discovery of interesting groups, trends,
associations and the visualization of new findings.
Text mining is considering as a subset of data mining. For this reason, adopts text mining
adopts the data mining techniques which uses machine learning algorithms. Computational
linguistics techniques also provides techniques to text mining. This science studies natural
language with computational methods to make them understandable by the operating system.
Automatic categorization determines the subject matter from a document collection. This
unlike clustering, choose the class to which a document belongs in a list of predefined classes.
Each category is trained through a previous manual process of categorization.
The classification starts with a set of training texts previously categorized then generate a
classification model based on the set of examples. This is be able to allocate the correct clas from
a new text. Decision tree is a classification technique that represent the knowledge through if-
else statements structure represented in the branches of a tree.
WEKA 44
Textual mining methodology provides a framework performed in four stages, data
acquisition, preprocessing documents, information extraction and evaluation of results. Witten,
Frank and Hall make mention of these steps in his work for the use of WEKA.
Data should be collected in a way that can create a training dataset. Witten, Frank and
Hall considers three input data for text mining. These are the concepts, instances and attributes.
The concepts specify what is want to learn. An instance represents the data from a class to be
classified. This containing a set of specific characteristics called attributes. An attribute
represents a measurement level of the attribute in that instance. In the case of document
classification, classes will be nominal attributes, because the categories need not represent an
order between them (ordinal attributes).
WEKA uses a standard format called File Attribute Relation (ARFF) to represent the
collection of documents into instances that share an ordered set of attributes divided into 3
sections, relationship, and attribute data.
Preprocessing data is based on the preparation of the text using a series of operations over
the text and generate some kind of structured or semi-structured information for analysis. The
most popular way to represent documents is with a vector. That vector contains all words found
in the text indicating its occurrence. Important tasks for preprocessing to categorize documents
are stemming, lexematización, removing empty words, tokenization and conversion to
lowercase.
Stemming algorithm eliminates morphemes and find the relationships between words and
lexeme not themed. Stopwords exclude the words that not help to generate knowledge of the
text. Tokenization is how to separate the text into words using punctuation. In Spanish
punctuation are "; . :? ! - -. () [] '"<< >>" Where the dot and dash are ambiguous in Spanish,
WEKA 45
unlike English contemplates a sign of end in an exclamation and interrogation. Conversion to
lowercase treat all letters regardless equal terms.
After data preprocess, the next step is knowledge extraction. Document classification in
weka look for learn a predictive classification model. These models are used to predict the class
to which an instance belongs. The model is created using the decision tree algorithm C4.5 as it is
the simplest and wide for the classification task.
Weka generates a confusion matrix for the generated model. This shows in an easy way
to detect how many times the model predictions were made correctly. The four possible
outcomes are: true positives, false positives, true negatives and false negatives. TP - true
positive: positive instance was predicted in the class as positive. TN - true negative: negative
instance correctly classified as negative. FP - false positives: positive instance was listed in the
wrong class. FN - false-negative negative instance incorrectly classified as positive.
The precision and recall are relevant metrics for document classification. The classified
model reports results in a binary form in a confusion matrix, to calculate the predictive efficiency
expressed. Precision is the percentage of positive cases correctly predicted: TP / (TP + FP).
Recall or sensitivity is the ability to predict positive instances on the total of all positive
instances: TP / (TP + FN). These measures are balanced as the F- measurement. It describes the
proportion of instances wrongly predicted. As far as resulting F1- measurement is calculated by
the following equation (2 * Accuracy * completeness) / (Accuracy + completeness).
The training data set selected has been found on the Thompson Rivers University library.
It was randomly selected 71 medical academic articles in English and Spanish stored in PDF
format. Based on the TRU library was classified this documents into six categories
WEKA 46
Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are
stored in directories named by its categories within the main folder called Medicine.
In order to form an arff file it an application that generated the arff from a documents
collection a directory based. This application could be carried out with the collaboration of a
library called iTextSharp PDF for a portable document format text extraction. This application is
named as Documents Directory to ARFF.
The resulting arff generate a string type attribute called "DocumentText" that describe
all text found in the document and the nominal attribute "docClass" that define the class to
which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class
attribute can never be named "class".
Various tests applied to the same set of texts to assess the predictive exactitude of the
model. They were generated a set of optimal options from different combinations of options
applied to the same training data . Each resulting model was calculated its F-measurement which
describes the proportion of its predicted instances erroneously.
First the best structure for the filter is analyzed, with unadjusted the J48 classifier options.
In this the best parameters for the filter were selected. It select the best configuration to assess
the best settings for J48 classifier algorithm. Based on a comparison chart it was discovered that
the parameters of the combination of Stopwords + Word Tokenizer E&S + Lower Case
Conversion adjusting the minNumObj to 1 on the J48 algorithm, provide values of 1 for recall
and precision.
Concluding that the best model is the combination of the options Word Tokenizer
Stopwords + S&E + Lower Case Conversion applied to the data preprocessing filter and further
adjusting minNumObj to 1 on the J48 classifier algorithm.
WEKA 47
References
[1] Witten, I. H., Frank, E. ;., & Hall, M. A. (2011). Data Mining: Practical Machine Learning
Tools and techniques / Ian H. Witten (3a. ed. --.). s.l.: Elsevier.
[2] Berry, M. W., & Kogan, J. (2010). Text mining. [electronic resource] : applications and
theory. Hoboken, NJ : John Wiley & Sons, 2010.
[3] Hearst (1999). Untangling Text Data Mining, Proc. of ACL’99: The 37th Annual Meeting of
the Association for Computational Linguistics, University of Maryland, June 20-26,
1999.
[4] Kodratoff (1999). Knowledge Discovery in Texts: A Definition and Applications, Proc. of
the 11th International Symposium on Foundations of Intelligent Systems (ISMIS-99),
1999
[5] Montes-y-Gómez, M. Minería de texto: Un nuevo reto computacional. Laboratorio de
Lenguaje Natural, Centro de Investigación en Computación, Instituto Politécnico
Nacional.
[6] Ethnologue,. (2015). Summary by language size. Retrieved 23 June 2015, from
https://www.ethnologue.com/statistics/size
[7] Brun, R.E., & Senso, J.A. (2004). Mineria Textual. El profecional de la informacion, 3.
[8] Hotho, A., Nürnberger, A. & Paaß, G. (2005). A Brief Survey of Text Mining. LDV Forum -
GLDV Journal for Computational Linguistics and Language Technology, 20, 19-62.
[9] Streibel, O. (2010). Mining Trends in Texts on the Web. DC-FIS 2010 Doctoral Consortium
of the Future Internet Symposium 2010.
WEKA 48
[10] Gémar, G., & Jiménez-Quintero, J. A. (2015). Text mining social media for competitive
analysis. Tourism & Management Studies, 11(1), 84-90.
[11] Quinlan J. R. (1986) Induccion of decision trees. Machine Learning, 1(1), 81–106.
[12] Quinlan (1993) C4.5: Programs for Machine Learning Morgan Kaufmann.
[13] Hernández, J., Ramírez, M.J., & Ferri, C. (2004). INTRODUCCIÓN A LA MINERÍA DE
DATOS. Pearson.
[14] Ye, N. (Ed.). (2003). The handbook of data mining (Vol. 24). Mahwah, NJ/London:
Lawrence Erlbaum Associates, Publishers.
[15] Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data
applications. Waltham, MA: Academic Press.
[16] Stevens, S. (1946). On The Theory Of Scales Of Measurement. Science, 677-680.
[17] Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification.
Information Processing And Management, 50104-112. doi:10.1016/j.ipm.2013.08.006
[18] Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic
Indexing. Communications Of The ACM, 18(11), 613-620. doi:10.1145/361219.361220
[19] Ning Liu, Benyu Zhang, Jun Yan, Zheng Chen, Wenyin Liu, Fengshan Bai, Leefeng Chien.
2005. Text Representation: From Vector to Tensor. In: IEEE International Conference on
Data Mining, ICDM, 2005. p.725-728.
[20] Munková, D., Munk, M., & Vozár, M. (2013). Data Pre-processing Evaluation for Text
Mining: Transaction/Sequence Model. Procedia Computer Science, 18(2013 International
Conference on Computational Science), 1198-1207. doi:10.1016/j.procs.2013.05.286
[21] Muñoz, A., & Álvarez, I. (2014). Esteganografía linguística en lengua española basada en
modelo N-gram y ley de Zipf. Arbor.
WEKA 49
[22] Ramesh, B., Xiang, C., & Lee, T. H. (2015). Shape classification using invariant features
and contextual information in the bag-of-words model. Pattern Recognition, 48894-906.
doi:10.1016/j.patcog.2014.09.019
[23] Ferilli, S., Esposito, F., & Grieco, D. (2014). Automatic Learning of Linguistic Resources
for Stopword Removal and Stemming from Text. Procedia Computer Science, 38(10th
Italian Research Conference on Digital Libraries, IRCDL 2014), 116-123.
doi:10.1016/j.procs.2014.10.019
[24] C., K. C., Anzola, J. P., & B., G. T. (2015). Classification Methodology Of Research
Topics Based In Decision Trees: J48 And Random tree. International Journal Of Applied
Engineering Research, 10(8), 19413-19424.
[25] Yan-yan, S., & Ying, L. (2015). Decision tree methods: applications for classification and
prediction. Shanghai Archives Of Psychiatry, 27(2), 130-135. doi:10.11919/j.issn.1002-
0829.215044
[26] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related
information: Review of current status and future directions. Sciencedirect.com. Retrieved
12 May 2015, from
http://www.sciencedirect.com/science/article/pii/S1386505614001105
[27] Ostrand, T., & Weyuker, E. (2007). How to measure success of fault prediction models.
Fourth International Workshop on Software Quality Assurance in Conjunction with the
6th ESEC/FSE Joint Meeting - SOQUA '07.
[28] Bowes, D., Hall, T., & Gray, D. (2013). DConfusion: A technique to allow cross study
performance evaluation of fault prediction studies. Autom Softw Eng Automated
Software Engineering, 287-313.
WEKA 50
[29] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related
information: Review of current status and future directions. Sciencedirect.com. Retrieved
12 May 2015, from
http://www.sciencedirect.com/science/article/pii/S1386505614001105
[30] Newzealand.com, (2015). Plantas y animales de Nueva Zelanda | Ruapehu, Nueva Zelanda.
Retrieved 11 July 2015, from http://www.newzealand.com/ar/feature/new-zealand-flora-
and-fauna/
[31] Hall, M., Frank, E., Geoffrey, H., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).
The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11,
Issue 1.
[32] Garcia, D., (2006). Weka Tutorial (Spanish). Retrieved 12 July 2015, from
http://www.metaemotion.com/diego.garcia.morate/download/weka.pdf
[33] Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing.
Communications of the ACM, 613-620.
[34] Namee, B. (2012). DIT MSc in Computing (Data Analytics): Text Analytics in Weka.
Ditmscda.blogspot.ca. Retrieved from http://ditmscda.blogspot.ca/2012/03/text-analytics-
in-weka.html
[35] Snowball.tartarus.org,. (2015). Defining R1 and R2. Retrieved 24 July 2015, from
http://snowball.tartarus.org/texts/r1r2.html
[36] Snowball.tartarus.org,. (2015). Spanish stemming algorithm. Retrieved from
http://snowball.tartarus.org/algorithms/spanish/stemmer.html
[37] Weka.wikispaces.com,. (2015). weka - GenericObjectEditor (book version). Retrieved from
https://weka.wikispaces.com/GenericObjectEditor+%28book+version%29
WEKA 51
[38] Cs.cmu.edu, (2015). Rainbow. Retrieved from
http://www.cs.cmu.edu/~mccallum/bow/rainbow/
[39] Weka.sourcearchive.com,. (2015). weka 3.6.0-3,
classweka_1_1filters_1_1unsupervised_1_1attribute_1_1StringToWordVector_a4ad7e64
ecb476e527a19afee2c96aea6.html. Retrieved from
http://weka.sourcearchive.com/documentation/3.6.0-
3/classweka_1_1filters_1_1unsupervised_1_1attribute_1_1StringToWordVector_a4ad7e
64ecb476e527a19afee2c96aea6.html
[40] Drazin, S., & Montag, M. (2015). Decision Tree Analysis using Weka. University of
Miami. Machine Learning – Project II. Retrieved from
http://ww.samdrazin.com/classes/een548/project2report.pdf
[41] Spasić, I., Livsey, J., Keane, J., & Nenadić, G. (2014). Text mining of cancer-related
information: Review of current status and future directions. Sciencedirect.com. Retrieved
12 May 2015, from
http://www.sciencedirect.com/science/article/pii/S1386505614001105
[42] Jindala, R., & Tanejab, S. (2015). A Lexical Approach for Text Categorization of Medical
Documents. Procedia Computer Science 46 314 – 320.
[43] Bui, D., & Zeng-Treitler, Q. (2014). Learning regular expressions for clinical text
classification. J Am Med Inform Assoc. 2014 Sep-Oct;21(5):850-7. doi: 10.1136/amiajnl-
2013-002411.
[44] Pérez, A., Gojenola, K., Casillas, A., Oronoz, M., & Díaz de Ilarraza, A. (2015). Computer
aided classification of diagnostic terms in spanish. Expert Systems With Applications,
422949-2958. doi:10.1016/j.eswa.2014.11.035
WEKA 52
[45] Vilares, D., Alonso, M. A., & Gómez, C. (2015). A syntactic approach for opinion mining
on Spanish reviews. Natural Language Engineering, 21(1), 139.
[46] Pérez Abelleira, M. Alicia, & Cardoso, Alejandra Carolina. (2010). Minería de texto para la
categorización automática de documentos. Cuadernos de la Facultad 5.
[47] Shams, R. (2015). Weka Tutorial 31: Document Classification 1 (Application). YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=jSZ9jQy1sfE
[48] Shams, R. (2015). Weka Tutorial 32: Document classification 2 (Application). YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=zlVJ2_N_Olo
[46] Rodríguez, J., Calot, E., & Merlino, H. (2014). Clasificación de prescripciones médicas en
español. Sedici.unlp.edu.ar. Retrieved 15 May 2015, from
http://sedici.unlp.edu.ar/handle/10915/42402
[49] Weinberg, B. (2015). Weka Text Classification for First Time & Beginner Users. YouTube.
Retrieved 15 May 2015, from https://www.youtube.com/watch?v=IY29uC4uem8.
[50] Nlm.nih.gov,. (2015). PubMed Tutorial - Building the Search - How It Works - Stopwords.
Retrieved 18 May 2015, from
http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_170.html