introduction to data mining

Introduction to Data Mining

Kai Koenig@AgentK

Web/Mobile Developer since the late 1990s

Interested in: Java & JVM, CFML, Functional

Programming, Go, Android, Data Science

And this is my view of the world…

Me

1. What is Data Mining? 2. Concepts and Terminology3. Weka4. Algorithms5. Dealing with Text 6. Java integration

Agenda

We are overwhelmed with data.

1. What is Data Mining?

Fundamentals

Why do we nowadays have SO MUCH data?

Reasons include:

- Cheap storage and better processing power

- Legal & Business requirements

- Digital hoarding

Fundamentals

Data Mining is all about going from data to useful and meaningful information.

- Recommendation in online shops

- Finding an “optimal” partner

- Weather prediction

- Judgement decisions (credit applications)

Fundamentals

A better definition

“Data Mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, often an economic one.”

(Prof. Dr. Ian Witten)

How can you express patterns?

Finding and applying rules

Tear Production Rate == reduced

none

Finding and applying rules

Age == young && Astigmatism == no

soft

Age == young && Astigmatism == no

soft

A Result: Decision lists

If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes

Not all rules are equal

Classification rules: predict an outcome

Association rules: rules that strongly associate different attribute values

If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high

2. Concepts and Terminology

Learning

What is Learning? And what is Machine Learning?

A good approach is:

“Things learn when they change their behaviour in a way that makes them perform better in the future”

Learning types

Classification learning

Association learning

Clustering

Numerical Prediction

Some basic terminology

The thing to be learned is the concept.

The output of a learning scheme is the concept description.

Classification learning is sometimes called supervised learning. The outcome is the class.

Examples are called instances.

Some more basic terminology

Discrete attribute values are usually called nominal values, continuous attribute values are called just numeric values.

Algorithms used to process data and find patterns are often called classifiers. There are lots of them and all of them can be heavily configured.

3. Weka

What is Weka?

Waikato Environment for Knowledge Analysis

Developed by a group in the Dept. of Computer Science at the University of Waikato in New Zealand.

Also, Weka is a New Zealand-only bird.

What is Weka?

Download for Mac OS X, Linux and Windows:

http://www.cs.waikato.ac.nz/~ml/weka/index.html

Weka is written in Java, comes either as native applications or executable .jar file and is licensed under GPL v3.

http://www.cs.waikato.ac.nz/~ml/weka/index.html

Getting data into Weka

Easiest and common for experimenting: .arff

Also supported: CSV, JSON, XML, JDBC connections etc.

Filters in Weka can then be used to preprocess data.

Features

50+ Preprocessing tools

75+ Classification/Regression algorithms

~10 clustering algorithms

… and a packet manager to load and install more if you want.

4. Algorithms

Classifiers

There are literally hundreds with lots of tuning options.

Main Categories:

- Rule-based (ZeroR, OneR, PART etc.)- Tree-based (J48, J48graft, CART etc.)- Bayes-based (NaiveBayes etc.)- Functions-based (LR, Logistic etc.)- Lazy (IB1, IBk etc.)

OneR

Very simplistic classifier and based on a single attribute.

For each attribute, For each value of that attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute value. Calculate the error rate of the rules.Choose the rules with the smallest error rate.

C4.5 (J48)

Produces a decision tree, derived from divide-and-conquer tree building techniques.

Decision trees are often verbose and need to be pruned - J48 uses post-pruning, pruning can in some instances be costly.

J48 usually provides a good balance re quality vs. cost (execution times etc.)

NaiveBayes

Very good and popular for document (text) classification.

Based on statistical modelling (Bayes formula of conditional probability)

In document classification we treat the existence or absence of a word as a Boolean attribute.

Training and Testing

We implicitly trained and tested our classifiers in the previous examples using Cross-Validation.

Training and Testing

Test data and Training data NEED to be different.

If you have only one dataset, split it up.

n-fold Cross-Validation:- Divides your dataset into n parts, holds out each part in turn- Trains with n-1 parts, tests with the held out part- Stratified CV is even better

5. Dealing with Text

Bag of Words

Generally for document classification we treat a document as a bag of words and the existence or absence of a word is a Boolean attribute.

This results in problems with very many attributes having 2 values each.

This is quite a bit different from the usual classification problem.

Filtered Classifiers

First step: use Filtered classifier with J48 and StringToWordVector filter.

Example: Reuters Corn datasets (train/test)

We get 97% accuracy, but there’s still an issue here -> investigate the confusion matrix

Is accuracy the best way to evaluate quality?

Better approaches to evaluation

Accuracy: (a+d)/(a+b+c+d)Recall: R = d/(c+d)Precision: P = d/(b+d)F-Measure: 2PR/(P+R)

False positive rate FP: b/(a+b)True negative rate TN: a/(a+b)False negative rate FN: c/(c+d)

predicted

– +

true – a b

+ c d

ROC (threshold) curves

Area under the threshold curve determines the overall quality of a classifier.

NaiveBayesMultinomial

Often the best classifier for document classification. In particular:- good ROC- good results on minority class (often what we want)


J48: 96% accuracy, 38/57 on grain docs, 544/547 on non-grain docs, ROC 0.91

NaiveBayes: 80% accuracy, 46/57 on grain docs, 439/547 on non-grain docs, ROC 0.885

NaiveBayesMultinomial: 91% accuracy, 52/57 on grain docs, 496/547 on non-grain docs, ROC 0.973


NaiveBayesMultinomial with stoplist, lowerCase and outputWords: 94% accuracy, 56/57 on grain docs, 504/547 on non-grain docs, ROC 0.978

Why? NBM is designed for text:

- based solely on word appearance- can deal with multiple repetitions of a word- faster than NB

6. Java integration

Weka is written in Java

The UI is essentially making use of a vast underlying data mining and machine learning API.

Obviously this fact invites us to use the API directly :)

Setting up a project (IntelliJ IDEA)

Create new Java project in IntelliJ

Import weka.jarImport weka-src.jar

Off you go!

The main classes/packages you need…

import weka.classifiers.Evaluation;import weka.classifiers.trees.J48;import weka.core.Instances;

Getting stuff done

Instances train = new Instances(bReader);train.setClassIndex(train.numAttributes()-1);

J48 j48 = new J48();j48.buildClassifier(train);

Evaluation eval = new Evaluation(train);eval.crossValidateModel(

j48, train, 10, new Random(1));

You can also grab Java code off Weka UI

Photo Credits

https://www.flickr.com/photos/johnnystiletto/3339808858/https://www.flickr.com/photos/theequinest/5056055144/https://www.flickr.com/photos/flyingkiwigirl/17385243168https://www.flickr.com/photos/x6e38/3440973490/https://www.flickr.com/photos/42931449@N07/5418402840/https://www.flickr.com/photos/gerardstolk/12194108005/https://www.flickr.com/photos/zzpza/3269784239/in/https://www.flickr.com/photos/internationaltransportforum/14258907973/

https://www.flickr.com/photos/johnnystiletto/3339808858/

https://www.flickr.com/photos/theequinest/5056055144/

https://www.flickr.com/photos/flyingkiwigirl/17385243168

https://www.flickr.com/photos/x6e38/3440973490/

https://www.flickr.com/photos/42931449@N07/5418402840/

https://www.flickr.com/photos/gerardstolk/12194108005/

https://www.flickr.com/photos/zzpza/3269784239/in/

https://www.flickr.com/photos/internationaltransportforum/14258907973/

Get in touch

Kai Koenig

Email: [email protected]

www.ventego-creative.co.nz

Blog: www.bloginblack.de

Twitter: @AgentK

introduction to data mining

Data & Analytics