introduction to data mining - oecd · introduction • what is data mining? •data mining is a...

31
1 Your reference Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010

Upload: others

Post on 03-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

1 Your reference

Title

Introduction to Data Mining

Dr Arulsivanathan Naidoo

Statistics South Africa

OECD Conference Cape Town

8-10 December 2010

Page 2: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

2 Your reference

Outline

• Introduction

• Statistics vs Knowledge Discovery

• Predictive Modeling

• Data Mining Examples

• Census 2011

• ROC

• Conclusions

Page 3: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

3 Your reference

Introduction

• What is Data Mining?

• Data Mining is a general term.

• Data mining is defined a an application of intelligent techniques such as decision trees, Neural networks, fuzzy logic genetic algorithm, nearest neighbour method, rule induction and data visualization to large quantities of data to discover hidden trends, patterns and relationships (Lam and Kamber 2006)

Page 4: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

4 Your reference

Orgins of Data Mining

Artificial Intellengence Databases

Statistics

Data Mining

KDD

Pattern Recognition

Machine Learning

Neurocomputing

Page 5: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

5 Your reference

Hypothesis Testing

Statement

Hypothesis

Analysis

Decision

Accept H0

Top

Down

Page 6: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

6 Your reference

Knowledge Discovery

Question?

Data

What item is purchased with

disposable Baby Napkins?

Answer Beer

Statement Up

Bottom

Page 7: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

7 Your reference

Unsupervised learning

Data

Association

Disassociation

Sequential

Cluster / SOM

Items bought together

Items not bought together

Items bought in order

Grouping- Segments

Page 8: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

8 Your reference

Supervised Learning

Data

Target Variable

Decision Tree

Regression

Neural Network

Two Stage

Page 9: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

9 Your reference

What is a Model?

One Word

Equation

Straight Line

Y = mX + c

Example: Countryside

Page 10: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

10 Your reference

Decision Tree

A decision tree model is constructed by segmenting a dataset using a series of simple rules, resulting in a hierarchy of segments within segments Algorithms such as the CHAID (chi squared automatic interactive detection) can be used to decide on how to split the segments. The hierarchy is called a tree and each segment a node

Page 11: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

11 Your reference

Decision Tree

100 M 100 W

Short Hair Long Hair

Earings No Earings

Predict everyone with short hair and earings is female

Page 12: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

12 Your reference

Regression

x1

x2

x3

Y

Page 13: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

13 Your reference

Neural Networks

x1

x2

x3

Y

H1

H2

Inputs

Black Box

Outputs

Page 14: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

14 Your reference

Two Stage

Buy from every Catalogue R100

Buy from Catalogue once/year R5000

Page 15: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

15 Your reference

Eurostat Funding

KESO ( Knowledge extraction for statistical offices) This is a Eurostat project with the goal to construct a versatile efficient industrial strength data mining system that satisfies the needs of providers large scale databases

SPIN (Spatial mining for data of public interest) was developed to support statistical offices in their timely and cost effective dissemination of statistical data by integrating the state of the art GIS and data mining functionality in an open highly extensible internet enabled plug in architecture

IDSA (Intelligent Data Control System) Hassain et al 2010 This is an application of data mining to the official statistics

Page 16: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

16 Your reference

NASS Decision Trees

Census Non Response Weighting

Census Mail List Trimming

Analysis of reporting Errors

Allocation of Survey Incentives

Prediction of Survey Non Respondents

Page 17: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

17 Your reference

NASS

Association Analysis

•Survey Data Edit design

Cluster Analysis

•2007 Census Donor Pool Screening

•Questionnaire design and Construction

•Identifying Subtypes of records Missing from the

Census Mail List

Page 18: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

18 Your reference

Examples

• Absa Branch Robberies

• Old Mutual Policies

• MTN prepaid

• HSBC Bank Credit Cards

• Royal Saudi Air Force

• Census 2011

Page 19: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

19 Your reference

Census 2011

Sample

Data Model B

Census

2001

Assess Score

Results

(Ranking)

Model C

Model A

Will Respond

High Wall Areas

Informal

Areas

Page 20: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

20 Your reference

Prediction Types

Training Data Predictions

Case 1 : inputs target

Case 2 : inputs target

Case 3 : inputs target

Case 4 : inputs target

Case 5 : inputs target

Decisions

Rankings

Estimates

Page 21: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

21 Your reference

Prediction Types

Training Data Decisions

Case 1 : inputs target

Case 2 : inputs target

Case 3 : inputs target

Case 4 : inputs target

Case 5 : inputs target

Success

Failure

Failure

Success

Success

Page 22: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

22 Your reference

Prediction Types

Training Data Rankings

Case 1 : inputs target

Case 2 : inputs target

Case 3 : inputs target

Case 4 : inputs target

Case 5 : inputs target

680

720

640

582

635

Page 23: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

23 Your reference

Prediction Types

Training Data Estimates

Case 1 : inputs target

Case 2 : inputs target

Case 3 : inputs target

Case 4 : inputs target

Case 5 : inputs target

0.45

0.53

0.62

0.55

0.47

Page 24: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

24 Your reference

Prediction Type

Validation Fit Statistic Direction

Decisions Misclassification

Average Profit/Loss

Kolmogorov-Smirnov Statistic

Smallest

Largest/Smallest

Largest

Rankings ROC Index (Concordance)

Gini Coefficient

Largest

Largest

Estimates Average Square Error

Schwarz’s Bayesian Criterion

Log-likelihood

Smallest

Smallest

Largest

Page 25: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

25 Your reference

Confusion Matrix

Actual

male

female

female

male

Predicted

True

Positive

False

Negative

False

Positive

True

Negative

d

c a

b

Page 26: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

26 Your reference

ROC

a

• Sensitivity = --------

a+b

d

• Specificity = --------

c+d

Page 27: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

27 Your reference

ROC

The ROC (Receiver Operating Characteristic)

curve was first used during World War 2 following the attacks on Pearl harbour in 1941. The US army research the prediction of correctly detecting Japanese aircraft from their radar signals

Page 28: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

28 Your reference

ROC Curve

Page 29: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

29 Your reference

Conclusion

Data mining is a growing discipline which originated outside statistics in the data base community mainly for commercial purposes Today data Mining can be considered a branch of exploratory statistics where useful models and patterns are uncovered through the extensive use of algorithms

Finally who should analyse huge data sets, the National statistics Offices or other research institutions

Data mining techniques use individual records not aggregate data There is by law the confidentiality clause The NSO are the best place and this will imply new directions of research

Page 30: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

30 Your reference

Conclusion

Official statistics should be a field for data mining giving new life and value to its huge data bases, but this may imply a redefinition of the visions and missions of official statistics offices South Africa changed its vision and mission this year

In Statistics South Africa we have acquired data mining software and we have started a data mining user group of over 100 researchers We are hoping to start a working paper series where some of this research will be published on our website for comments

Page 31: Introduction to Data Mining - OECD · Introduction • What is Data Mining? •Data Mining is a general term. • Data mining is defined a an application of intelligent techniques

31 Your reference

Thank you