1 chapter 2 data mining tasks. 2 prediction methods use some variables to predict unknown or future...

1

Chapter 2

Data Mining Tasks

2

Data Mining Tasks

Prediction methods Use some variables to predict unknown or future values

of the same or other variables. Inference on the current data in order to make

prediction

Description methods Find human interpretable patterns that describe data Characterize the general properties of data in db

Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making

3

Cont’d

Association Rule Mining (descriptive)

Classification and Prediction (predictive)

Clustering (descriptive)

Sequential Pattern Discover (descriptive)

Regression (predictive)

Deviation Detection (predictive)

4

Association Rule Mining

Initially developed for market basket analysis

Goal is to discover relationships between attributes

Data is typically stored in very large databases, sometimes in flat files or images

Uses include decision support, classification and clustering

Application areas include business, medicine and engineering

5


Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraintsSupport = (#T containing X and Y)/(#T)Confidence=(#T containing X and Y)/ (#T containing X)Applications Cross selling and up selling Supermarket shelf

management

Some rules discovered Bread Jem

Sup=60%, conf=75% Jelly Bread

Sup=60%, conf=100% Jelly Jem

Sup=20%, conf=100% Jelly Milk

Sup=0%

Transaction ItemsT1 Bread, Jelly, JemT2 Bread, JemT3 Bread, Milk, JemT4 Coffee, BreadT5 Coffee, Milk

6

Association Rule Mining:Definition

Given a set of records, each of which contain some number of items from a given collection: Produce dependency rules which will predict

occurrence of an item based on occurrences of other items

Example: {Bread} {Jem} {Jelly} {Jem}

7

Association Rule Mining:Marketing and sales promotion

Say the rule discovered is

{Bread, …} {Jem}

Jem as a consequent: can be used to determine what products will boost its sales.

Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread

Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem.

8

Association Rule Mining:Supermarket shelf management

Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved.Data Used: Point-of sale data collected with barcode scanners to find dependencies among products.Example If customer buys jelly, then he is very likely to by Jem. So don’t be surprised if you find Jem next to Jelly on an

aisle in the super market. Also salsa next to tortilla chips.

9


Association rule mining will produce LOTS of rules

How can you tell which ones are important? High Support High Confidence Rules involving certain attributes of interest Rules with a specific structure Rules with support / confidence higher than expected

Completeness – Generating all interesting rules

Efficiency – Generating only rules that are interesting

10

Clustering

Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not

Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following: Given a set of data points, each having a set of attributes, and a

similarity measure, find cluster such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another

11

Cont’d

Similarity measures: Euclidean distance (continuous attr.) Other problem – specific measures

Types of Clustering Group-Based Clustering Hierarchical Clustering

12

Clustering Example

Euclidean distance based clustering in 3D space Intra cluster distances

are minimised Inter cluster distances

are maximised

13

Clustering: Market Segmentation

Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mixApproach: Collect different attributes of customers based on their

geographical and lifestyle related information Find clusters of similar customers Measure the clustering quality by observing the buying

patterns of customers in the same cluster vs. those from different clusters.

14

Clustering: Document Clustering

Goal: To find groups of documents that are similar to each other based on important terms appearing in themApproach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters.Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents

15

Clustering: Document Clustering Example

Clustering points: 3204 articles of LA Times

Similarity measure: Number of common words in documents (after some word filtering)

Category Total articles Correctly placed articles

Financial

Foreign

National

Metro

Sports

Entertainment

555

341

273

943

738

354

364

260

36

746

573

278

16

Classification: Definition

Given a set of records (called the training set) Each record contains a set of attributes. One of the

attributes is the class

Find a model for the class attribute as a function of the values of other attributesGoal: Previous unseen records should be assigned to a class as accurately as possible Usually, the given data set is divided into training and

test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.

17

Classification: cont’d

Classifiers are created using labeled training samplesClassifiers are evaluated using independent labeled samples (test set)Training samples created by ground truth / expertsClassifier later used to classify unknown samplesMeasurements must be able to predict the phenomenon!Examples Direct marketing Fraud detection Customer churn Sky survey cataloging Classifying galaxies

18

Classification Example

Tid RefundMaritalStatus

TaxableIncome

Cheat

123456789

10

YesNoNo

YesNoNo

YesNoNoNo

SingleMarriedSingle

MarriedDivorcedMarried

DivorcedSingle

MarriedSingle

125K100K70K

120K95K60K

220K85K75K90K

NoNoNoNo

YesNoNo

YesNo

Yes

TrainingSet

LearnClassifier

Model

Testset

RefundMaritalStatus

TaxableIncome

Cheat

YesNoNoYesNoNoYesNoNoNo

SingleMarriedSingle

MarriedDivorcedMarried

DivorcedSingle

MarriedSingle

125K100K70K

120K95K60K

220K85K75K90K

NoNoNoNoYesNoNoYesNoYes

19

Classification: Direct Marketing

Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone productApproach: Use the data collected for a similar product introduced in the

recent past. Use the profiles of consumers along with their (buy, didn’t buy}

decision. The latter becomes the class attribute. The profile of the information may consist of demographic,

lifestyle and company interaction. Demographic – Age, Gender, Geography, Salary Psychographic - Hobbies Company Interaction – Recentness, Frequency, Monetary

Use these information as input attributes to learn a classifier model

20

Classification: Fraud DetectionGoal: Predict fraudulent cases in credit card transactionsApproach: Use credit card transactions and the information on its

account holders as attributes (important: when and where the card was used)

Label past transactions as {fraud, fair} transactions. This forms the class attribute

Learn a model for the class of transactions Use this model to detect fraud by observing credit card

transactions on an account.

21

Regression

Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependencyExtensively studied in the fields of Statistics and Neural Networks Predicting sales number of new product based on

advertising expenditure Predicting wind velocities based on temperature,

humidity, air pressure, etc Time series prediction of stock market indices

22

Deviation/Anomaly Detection

Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers

Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity

Goal of deviation/anomaly detection is to detect significant deviations from normal behavior

23

Deviation/Anomaly Detection:Definition

Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining dataThis can be viewed as two sub problems Define what data can be considered as

inconsistent in a given data set Find an efficient method to mine the outliers

24

Deviation:Credit Card Fraud Detection

Goal: to detect fraudulent credit card transactions

Approach: Based on past usage patterns, develop model for

authorized credit card transactions Check for deviation from model, before authenticating

new credit card transactions Hold payment and verify authenticity of “doubtful”

transaction by other means (phone call, etc.)

25

Anomaly detection:Network Intrusion Detection

Goal: to detect intrusion of a computer networkApproach: Define and develop a model for normal user

behavior on the computer network Continuously monitor behavior of users to

check if it deviates from the defined normal behavior

Raise an alarm, if such deviation is found

26

Sequential pattern discovery:definition

Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events

Sequence discovery aims at extracting sets of events that commonly occur over a period of time

(A B) (C) (D E)

27

Sequential pattern discovery:Telecommunication Alarm Logs

Telecommunication alarm logs (Inverter_Problem Excessive_Line_Current)

(Rectifier_Alarm) (Fire_Alarm)

28

Sequential pattern discovery:Point of Sell Up Sell / Cross Sell

Point of sale transaction sequences Computer bookstore

(Intro_to_Visual_C) (C++ Primer) (Perl_For_Dummies, Tcl_Tk)

60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month

Athletic apparel store (Shoes) (Racket, Racket ball) (Sport_Jacket)

29

Example: Data Mining(Weather data)

By applying various data mining techniques, we can find associations and regularities in our data Extract knowledge in the forms of rules, decision trees

etc. Predict the value of the dependent variable in new

situation

Some example Mining association rules Classification by decision trees and rules Prediction methods

30

Mining association rules

First, discretize the numeric attributes (a part of the data preprocessing stage)Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal)Substitute the values in data with the corresponding namesApply the Apriori algorithm and get the following rules

31

Discretized weather dataDay outlook temperature humidity windy play

1 sunny hot high false No

2 sunny hot high true No

3 overcast hot high False Yes

4 rainy mild high False Yes

5 rainy cool normal False Yes

6 rainy cool normal True No

7 overcast cool normal True Yes

8 sunny mild high False No

9 sunny cool normal False Yes

10 rainy mild normal False Yes

11 sunny mild normal True Yes

12 overcast mild high True Yes

13 overcast hot normal False Yes

14 rainy mild high true no

32

Cont’d

1. humidity=normal windy=false play=yes (4,1)2. temperature=cool humidity=normal (4,1)3. outlook=overcast play=yes (4,1)4. temperature=cool play=yes humidity=normal (3,1)5. outlook=rainy windy=false play=yes (3, 1)6. outlook=rainy play=yes windy=false (3, 1)7. outlook=sunny humidity=high play=no (3, 1)8. outlook=sunny play=no humidity=high (3, 1)9. temperature=cool windy=false humidity=normal play=yes (2,

1)10. temperature=cool humidity=normal windy=false play=yes (2,

1)

33

Cont’d

These rules show some attribute values sets (itemsets) that appear frequently in the data

Support (the number of occurrences of the itemset in the data)

Confidence (accuracy) of the rules

Rule 3 – the same as the one that is produced by observing the data cube

34

Classification by Decision Trees and Rules

Using ID3 algorithm, the following decision tree is producedOutlook=sunny Humidity=high:no Humidity=normal:yes

Outlook=overcast:yesOutlook=rainy Windy=true:no Windy=false:yes

35

Cont’d

Decision tree consists of: Decision nodes that test the values of their

corresponding attribute Each value of this attribute leads to a subtree and so on,

until the leaves of the tree are reached They determine the value of the dependent variable

Using a decision tree we can classify new tuples

36

Cont’d

A decision tree can be presented as a set of rules Each rule represents a path through the tree from the root to

a leaf

Other data mining techniques can produce rules directly: Prism algorithmif outlook=overcast then yesif humidity=normal and windy=false then yesIf temperature=mild and humidity=normal the yesIf outlook=rainy and windy=false then yesIf outlook=sunny and humidity=high then noIf outlook=rainy and windy=true then no

37

Prediction methods

DM offers techniques to predict the value of the dependent variable directly without first generating a model

The most popular approaches is based of statistical methods

Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables

38

Cont’d

Eg: applying Bayes to the new tuple:(sunny, mild, normal, false, ?)

P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2

The predicted value must be “yes”

39

Data Mining : Problems and Challenges

Noisy data

Difficult Training

Set

Incomplete Data

Dynamic Databases

Large Databases

40

Noisy data

many of attribute values will be inexact or incorrect erroneous instruments measuring some property human errors occurring at data entry

two forms of noise in the data corrupted values - some of the values in the training set

are altered from the original form missing values - one or more of the attribute values

may be missing both for examples in the training set and for object which are to be classified.

41

Difficult Training Set

Non-representative data Learning are based on a few examples Using large db, the rules probably representative

Absence of boundary cases To find the real differences between two classes

Limited information Two objects to be classified give the same conditional

attributes but are classified in the diff class Not have enough information of distinguishing two

types of objects

42

Dynamic databases

Db change continually

Rules that reflect the content of the db at all time (preferred)

If same changes are made, the whole learning process may have to be conducted again

43

Large databases

The size of db to be ever increasing

Machine learning algorithms – handling a small training set (a few hundred examples)

Much care on using similar techniques in larger db

Large db – provide more knowledge (eg. rules may be enormous)

44

Data Mining – Issues in Data Mining

User Interaction / Visualization

Incorporation of Background Knowledge

Noisy or Incomplete Data

Determining Interestingness of Patterns

Efficiency and Scalability

Parallel and Distributed Mining

Incremental Learning / Mining Time-Changing Phenomena

Mining from Image / Video / Audio Data

Mining Unstructured Data

1 chapter 2 data mining tasks. 2 prediction methods use some variables to predict unknown or future...

Documents