data mining part i iiit allahabad

64
DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA [email protected] http://lyle.smu.edu/~ mhd/dmiiit.html Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Support provided by Fulbright Grant and IIIT Allahabad IIIT Allahabad 1

Upload: jude

Post on 09-Feb-2016

60 views

Category:

Documents


1 download

DESCRIPTION

DATA MINING Part I IIIT Allahabad. Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA [email protected] http://lyle.smu.edu/~ mhd/dmiiit.html - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DATA MINING  Part I IIIT Allahabad

1

DATA MINING Part I

IIIT Allahabad

Margaret H. DunhamDepartment of Computer Science and Engineering

Southern Methodist UniversityDallas, Texas 75275, USA

[email protected]://lyle.smu.edu/~mhd/dmiiit.html

Some slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

Support provided by Fulbright Grant and IIIT AllahabadIIIT Allahabad

Page 2: DATA MINING  Part I IIIT Allahabad

2IIIT Allahabad

Page 3: DATA MINING  Part I IIIT Allahabad

3

Data Mining Outline

Part I: Introduction (19/1 – 20/1) Part II: Classification (24/1 – 27/1) Part III: Clustering (31/1 – 3/2) Part IV: Association Rules (7/2 – 10/2) Part V: Applications (14/2 – 17/2)

IIIT Allahabad

Page 4: DATA MINING  Part I IIIT Allahabad

Class Structure

Each class is two hours Tuesday/Wednesday presentation Thursday/Friday Lab

4IIIT Allahabad

Page 5: DATA MINING  Part I IIIT Allahabad

5

Data Mining Part I Introduction Outline

Lecture– Define data mining– Data mining vs. databases– Basic data mining tasks– Data mining development– Data mining issues

Lab– Download XLMiner and Weka– Analyze simple dataset

IIIT Allahabad

Goal: Provide an overview of data mining.

Page 6: DATA MINING  Part I IIIT Allahabad

6

Introduction

Data is growing at a phenomenal rate Users expect more sophisticated

information How?

IIIT Allahabad

UNCOVER HIDDEN INFORMATIONDATA MINING

Page 7: DATA MINING  Part I IIIT Allahabad

7

Data Mining Definition

Finding hidden information in a database

Fit data to a model Similar terms

– Exploratory data analysis– Data driven discovery– Deductive learning

IIIT Allahabad

Page 8: DATA MINING  Part I IIIT Allahabad

8

Data Mining Algorithm

Objective: Fit Data to a Model– Descriptive– Predictive

Preference – Technique to choose the best model

Search – Technique to search the data– “Query”

IIIT Allahabad

Page 9: DATA MINING  Part I IIIT Allahabad

9

Database Processing vs. Data Mining Processing

Query– Well defined– SQL

Query– Poorly defined– No precise query language

IIIT Allahabad

Data– Operational data

Output– Precise– Subset of database

Data– Not operational data

Output– Fuzzy– Not a subset of database

Page 10: DATA MINING  Part I IIIT Allahabad

10

Query Examples Database

Data Mining

IIIT Allahabad

– Find all customers who have purchased milk

– Find all items which are frequently purchased with milk. (association rules)

– Find all credit applicants with last name of Smith.– Identify customers who have purchased more

than $10,000 in the last month.

– Find all credit applicants who are poor credit risks. (classification)

– Identify customers with similar buying habits. (Clustering)

Page 11: DATA MINING  Part I IIIT Allahabad

11

Basic Data Mining Tasks Classification maps data into predefined

groups or classes– Supervised learning– Prediction– Regression

Clustering groups similar data together into clusters.– Unsupervised learning– Segmentation– Partitioning

IIIT Allahabad

Page 12: DATA MINING  Part I IIIT Allahabad

12

Basic Data Mining Tasks (cont’d)

Link Analysis uncovers relationships among data.– Affinity Analysis– Association Rules– Sequential Analysis determines sequential

patterns.

IIIT Allahabad

Page 13: DATA MINING  Part I IIIT Allahabad

13

CLASSIFICATION

Assign data into predefined groups or classes.

IIIT Allahabad

Page 14: DATA MINING  Part I IIIT Allahabad

14

But it isn’t Magic

You must know what you are looking for You must know how to look for you

IIIT Allahabad

Suppose you knew that a specific cave had gold:

What would you look for?

How would you look for it?

Might need an expert miner

Page 15: DATA MINING  Part I IIIT Allahabad

15

“If it looks like a duck, walks like a duck, and quacks like a duck, then

it’s a duck.”

IIIT Allahabad

Description Behavior Associations

Classification Clustering Link Analysis (Profiling) (Similarity)

“If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then

it’s a terrorist.”

Page 16: DATA MINING  Part I IIIT Allahabad

16

Classification Ex: Grading

IIIT Allahabad

>=90<90x

>=80<80

x

>=70<70

x

F

B

A

>=60<50

x C

D

Page 17: DATA MINING  Part I IIIT Allahabad

17IIIT Allahabad

Grasshoppers

KatydidsGiven a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.

(c) Eamonn Keogh, [email protected]

user
Page 18: DATA MINING  Part I IIIT Allahabad

18IIIT Allahabad

Insect ID Abdomen Length

Antennae Length

Insect Class

1 2.7 5.5 Grasshopper

2 8.0 9.1 Katydid

3 0.9 4.7 Grasshopper

4 1.1 3.1 Grasshopper

5 5.4 8.5 Katydid

6 2.9 1.9 Grasshopper

7 6.1 6.6 Katydid

8 0.5 1.0 Grasshopper

9 8.3 6.6 Katydid

10 8.1 4.7 Katydid

11 5.1 7.0 ???????

The classification problem can now be expressed as:

Given a training database predict the class label of a previously unseen instance

previously unseen instance = (c) Eamonn Keogh, [email protected]

Page 19: DATA MINING  Part I IIIT Allahabad

19IIIT Allahabad

Ant

enna

Len

gth

10

1 2 3 4 5 6 7 8 9 10

123456789

Grasshoppers KatydidsAbdomen Length

(c) Eamonn Keogh, [email protected]

Page 20: DATA MINING  Part I IIIT Allahabad

20IIIT Allahabad

Facial Recognition

(c) Eamonn Keogh, [email protected]

Page 21: DATA MINING  Part I IIIT Allahabad

21IIIT Allahabad

Handwriting

Recognition

George Washington Manuscript

0 50 100 150 200 250 300 350 400 4500

0.5

1

(c) Eamonn Keogh, [email protected]

Page 22: DATA MINING  Part I IIIT Allahabad

22

Anomaly Detection

IIIT Allahabad

Page 23: DATA MINING  Part I IIIT Allahabad

23IIIT Allahabad

Page 24: DATA MINING  Part I IIIT Allahabad

24

CLUSTERING

Partition data into previously undefined groups.

IIIT Allahabad

Page 25: DATA MINING  Part I IIIT Allahabad

25IIIT Allahabad

http://149.170.199.144/multivar/ca.htm

Page 26: DATA MINING  Part I IIIT Allahabad

26IIIT Allahabad

What is Similarity?

(c) Eamonn Keogh, [email protected]

Page 27: DATA MINING  Part I IIIT Allahabad

27

Two Types of Clustering

IIIT Allahabad

Hierarchical Partitional

(c) Eamonn Keogh, [email protected]

Page 28: DATA MINING  Part I IIIT Allahabad

28

Hierarchical Clustering Example

Iris Data Set

IIIT Allahabad

Setosa

Versicolor

VirginicaThe data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .

Page 29: DATA MINING  Part I IIIT Allahabad

29

http://www.time.com/time/magazine/article/0,9171,1541283,00.html

IIIT Allahabad

Page 30: DATA MINING  Part I IIIT Allahabad

30

Microarray Data Analysis Each probe location associated with gene Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions

– Which genes are related to this disease?– Which genes behave in a similar manner?– What is the function of a gene?

Clustering– Hierarchical– K-means

IIIT Allahabad

Page 31: DATA MINING  Part I IIIT Allahabad

31

Microarray Data - Clustering

IIIT Allahabad

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004

Page 32: DATA MINING  Part I IIIT Allahabad

32

ASSOCIATION RULES/ LINK ANALYSIS

Find relationships between data

IIIT Allahabad

Page 33: DATA MINING  Part I IIIT Allahabad

33

ASSOCIATION RULES EXAMPLES

People who buy diapers also buy beer If gene A is highly expressed in this

disease then gene A is also expressed Relationships between people Book Stores Department Stores Advertising Product PlacementIIIT Allahabad

Page 34: DATA MINING  Part I IIIT Allahabad

34IIIT Allahabad

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.DILBERT reprinted by permission of United Feature Syndicate, Inc.

Page 35: DATA MINING  Part I IIIT Allahabad

35IIIT Allahabad

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 36: DATA MINING  Part I IIIT Allahabad

36

No/Little Cheating

IIIT Allahabad

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 37: DATA MINING  Part I IIIT Allahabad

37

Rampant Cheating

IIIT Allahabad

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

Page 38: DATA MINING  Part I IIIT Allahabad

38IIIT Allahabad

Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

Page 39: DATA MINING  Part I IIIT Allahabad

39

Ex: Stock Market Analysis

Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

IIIT Allahabad

Page 40: DATA MINING  Part I IIIT Allahabad

40

Ex: Stock Market Analysis

IIIT Allahabad

Page 41: DATA MINING  Part I IIIT Allahabad

41

Data Mining vs. KDD

Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.

Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

IIIT Allahabad

Page 42: DATA MINING  Part I IIIT Allahabad

42

KDD Process

Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format.

Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to

user in meaningful manner.IIIT Allahabad

Modified from [FPSS96C]

Page 43: DATA MINING  Part I IIIT Allahabad

43

KDD Process Ex: Web Log Selection:

– Select log data (dates and locations) to use Preprocessing:

– Remove identifying URLs; Remove error logs Transformation:

– Sessionize (sort and group) Data Mining:

– Identify and count patterns; Construct data structure Interpretation/Evaluation:

– Identify and display frequently accessed sequences. Potential User Applications:

– Cache prediction– Personalization

IIIT Allahabad

Page 44: DATA MINING  Part I IIIT Allahabad

Related Topics

Databases OLTP OLAP Information Retrieval

44IIIT Allahabad

Page 45: DATA MINING  Part I IIIT Allahabad

45

DB & OLTP Systems Schema

– (ID,Name,Address,Salary,JobNo) Data Model

– ER– Relational

Transaction Query:

SELECT NameFROM TWHERE Salary > 100000

DM: Only imprecise queriesIIIT Allahabad

Page 46: DATA MINING  Part I IIIT Allahabad

46

Classification/Prediction is Fuzzy

Loan

Amnt

Simple Fuzzy

Accept Accept

RejectReject

IIIT Allahabad

Page 47: DATA MINING  Part I IIIT Allahabad

47

Information Retrieval Information Retrieval (IR): retrieving desired

information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:

Find all documents about “data mining”.

DM: Similarity measures; Mine text/Web data.

IIIT Allahabad

Page 48: DATA MINING  Part I IIIT Allahabad

48

Information Retrieval (cont’d) Similarity: measure of how close a

query is to a document. Documents which are “close enough”

are retrieved. Metrics:

–Precision = |Relevant and Retrieved| |Retrieved|

–Recall = |Relevant and Retrieved| |Relevant|

IIIT Allahabad

Page 49: DATA MINING  Part I IIIT Allahabad

49

IR Query Result Measures and Classification

IR ClassificationIIIT Allahabad

Page 50: DATA MINING  Part I IIIT Allahabad

50

OLAP Online Analytic Processing (OLAP): provides more

complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional

database/transaction processing. Dimensional data; cube view Visualization of operations:

– Slice: examine sub-cube.– Dice: rotate cube to look at another dimension.– Roll Up/Drill Down

DM: May use OLAP queries.

IIIT Allahabad

Page 51: DATA MINING  Part I IIIT Allahabad

51

DM vs. Related TopicsArea Query Data Results Output

DB/OLTP Precise Database Precise DB Objects or Aggregation

IR Precise Documents Vague Documents OLAP Analysis Multidimensional Precise DB Objects

or Aggregation

DM Vague Preprocessed Vague KDD Objects

IIIT Allahabad

Page 52: DATA MINING  Part I IIIT Allahabad

52

Data Mining Development

IIIT Allahabad

• Similarity Measures• Hierarchical Clustering• IR Systems• Imprecise Queries• Textual Data• Web Search Engines

• Bayes Theorem• Regression Analysis• EM Algorithm• K-Means Clustering• Time Series Analysis

• Neural Networks• Decision Tree Algorithms

• Algorithm Design Techniques• Algorithm Analysis• Data Structures

• Relational Data Model• SQL• Association Rule

Algorithms• Data Warehousing• Scalability Techniques

Page 53: DATA MINING  Part I IIIT Allahabad

53

KDD Issues

Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality

IIIT Allahabad

Page 54: DATA MINING  Part I IIIT Allahabad

Overfitting Suppose we want to predict whether an individual is

short, medium, or tall in height. What is wrong with this data?

54IIIT Allahabad

Name Gender Height OutputMary F 1.6 ShortMaggie F 1.9 MediumMartha F 1.88 MediumStephanie F 1.7 ShortBob M 1.85 MediumKathy F 1.6 ShortGeorge M 1.7 ShortDebbie F 1.8 MediumTodd M 1.95 MediumKim F 1.9 MediumAmy F 1.8 MediumWynette F 1.75 Medium

Page 55: DATA MINING  Part I IIIT Allahabad

55

KDD Issues (cont’d)

Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application

IIIT Allahabad

Page 56: DATA MINING  Part I IIIT Allahabad

56

WARNING With data mining you don’t always know

what you are looking for. There is not one right answer. The data you are using is noisy Data Mining is a very applied discipline. A data mining course provides you tools

to use to analyze data. Experience provides you knowledge of

how to use these tools.IIIT Allahabad

Page 58: DATA MINING  Part I IIIT Allahabad

58IIIT Allahabad

Page 59: DATA MINING  Part I IIIT Allahabad

59

Social Implications of DM

Privacy Profiling Unauthorized use Invalid results and claims

IIIT Allahabad

Page 60: DATA MINING  Part I IIIT Allahabad

60

Data Mining Metrics

Usefulness Return on Investment (ROI) Accuracy … Space/Time

IIIT Allahabad

Page 61: DATA MINING  Part I IIIT Allahabad

61

Visualization Techniques

Graphical Geometric Icon-based Pixel-based Hierarchical Hybrid

IIIT Allahabad

Page 62: DATA MINING  Part I IIIT Allahabad

62

Models Based on Summarization

Visualization: Frequency distribution, mean, variance, median, mode, etc.

Box Plot:

IIIT Allahabad

Page 63: DATA MINING  Part I IIIT Allahabad

63

Scatter Diagram

IIIT Allahabad

Page 64: DATA MINING  Part I IIIT Allahabad

DM Tools XLMiner – Easy addin to Excelhttp://www.solver.com/xlminer/index.html Weka – Open Source; Visualization,

Functionality, Interfacehttp://www.cs.waikato.ac.nz/ml/weka/ SAS (JMP) – Commercial Product SPSS – Commercial Product MATLAB – Statistical/Math Applications R – Programming

64IIIT Allahabad