1 chapter 2 data mining tasks. 2 prediction methods use some variables to predict unknown or future...
Post on 20-Dec-2015
215 views
TRANSCRIPT
1
Chapter 2
Data Mining Tasks
2
Data Mining Tasks
Prediction methods Use some variables to predict unknown or future values
of the same or other variables. Inference on the current data in order to make
prediction
Description methods Find human interpretable patterns that describe data Characterize the general properties of data in db
Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making
3
Cont’d
Association Rule Mining (descriptive)
Classification and Prediction (predictive)
Clustering (descriptive)
Sequential Pattern Discover (descriptive)
Regression (predictive)
Deviation Detection (predictive)
4
Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between attributes
Data is typically stored in very large databases, sometimes in flat files or images
Uses include decision support, classification and clustering
Application areas include business, medicine and engineering
5
Association Rule Mining
Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraintsSupport = (#T containing X and Y)/(#T)Confidence=(#T containing X and Y)/ (#T containing X)Applications Cross selling and up selling Supermarket shelf
management
Some rules discovered Bread Jem
Sup=60%, conf=75% Jelly Bread
Sup=60%, conf=100% Jelly Jem
Sup=20%, conf=100% Jelly Milk
Sup=0%
Transaction ItemsT1 Bread, Jelly, JemT2 Bread, JemT3 Bread, Milk, JemT4 Coffee, BreadT5 Coffee, Milk
6
Association Rule Mining:Definition
Given a set of records, each of which contain some number of items from a given collection: Produce dependency rules which will predict
occurrence of an item based on occurrences of other items
Example: {Bread} {Jem} {Jelly} {Jem}
7
Association Rule Mining:Marketing and sales promotion
Say the rule discovered is
{Bread, …} {Jem}
Jem as a consequent: can be used to determine what products will boost its sales.
Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread
Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem.
8
Association Rule Mining:Supermarket shelf management
Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved.Data Used: Point-of sale data collected with barcode scanners to find dependencies among products.Example If customer buys jelly, then he is very likely to by Jem. So don’t be surprised if you find Jem next to Jelly on an
aisle in the super market. Also salsa next to tortilla chips.
9
Association Rule Mining
Association rule mining will produce LOTS of rules
How can you tell which ones are important? High Support High Confidence Rules involving certain attributes of interest Rules with a specific structure Rules with support / confidence higher than expected
Completeness – Generating all interesting rules
Efficiency – Generating only rules that are interesting
10
Clustering
Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not
Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following: Given a set of data points, each having a set of attributes, and a
similarity measure, find cluster such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another
11
Cont’d
Similarity measures: Euclidean distance (continuous attr.) Other problem – specific measures
Types of Clustering Group-Based Clustering Hierarchical Clustering
12
Clustering Example
Euclidean distance based clustering in 3D space Intra cluster distances
are minimised Inter cluster distances
are maximised
13
Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mixApproach: Collect different attributes of customers based on their
geographical and lifestyle related information Find clusters of similar customers Measure the clustering quality by observing the buying
patterns of customers in the same cluster vs. those from different clusters.
14
Clustering: Document Clustering
Goal: To find groups of documents that are similar to each other based on important terms appearing in themApproach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters.Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents
15
Clustering: Document Clustering Example
Clustering points: 3204 articles of LA Times
Similarity measure: Number of common words in documents (after some word filtering)
Category Total articles Correctly placed articles
Financial
Foreign
National
Metro
Sports
Entertainment
555
341
273
943
738
354
364
260
36
746
573
278
16
Classification: Definition
Given a set of records (called the training set) Each record contains a set of attributes. One of the
attributes is the class
Find a model for the class attribute as a function of the values of other attributesGoal: Previous unseen records should be assigned to a class as accurately as possible Usually, the given data set is divided into training and
test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.
17
Classification: cont’d
Classifiers are created using labeled training samplesClassifiers are evaluated using independent labeled samples (test set)Training samples created by ground truth / expertsClassifier later used to classify unknown samplesMeasurements must be able to predict the phenomenon!Examples Direct marketing Fraud detection Customer churn Sky survey cataloging Classifying galaxies
18
Classification Example
Tid RefundMaritalStatus
TaxableIncome
Cheat
123456789
10
YesNoNo
YesNoNo
YesNoNoNo
SingleMarriedSingle
MarriedDivorcedMarried
DivorcedSingle
MarriedSingle
125K100K70K
120K95K60K
220K85K75K90K
NoNoNoNo
YesNoNo
YesNo
Yes
TrainingSet
LearnClassifier
Model
Testset
RefundMaritalStatus
TaxableIncome
Cheat
YesNoNoYesNoNoYesNoNoNo
SingleMarriedSingle
MarriedDivorcedMarried
DivorcedSingle
MarriedSingle
125K100K70K
120K95K60K
220K85K75K90K
NoNoNoNoYesNoNoYesNoYes
19
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone productApproach: Use the data collected for a similar product introduced in the
recent past. Use the profiles of consumers along with their (buy, didn’t buy}
decision. The latter becomes the class attribute. The profile of the information may consist of demographic,
lifestyle and company interaction. Demographic – Age, Gender, Geography, Salary Psychographic - Hobbies Company Interaction – Recentness, Frequency, Monetary
Use these information as input attributes to learn a classifier model
20
Classification: Fraud DetectionGoal: Predict fraudulent cases in credit card transactionsApproach: Use credit card transactions and the information on its
account holders as attributes (important: when and where the card was used)
Label past transactions as {fraud, fair} transactions. This forms the class attribute
Learn a model for the class of transactions Use this model to detect fraud by observing credit card
transactions on an account.
21
Regression
Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependencyExtensively studied in the fields of Statistics and Neural Networks Predicting sales number of new product based on
advertising expenditure Predicting wind velocities based on temperature,
humidity, air pressure, etc Time series prediction of stock market indices
22
Deviation/Anomaly Detection
Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers
Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity
Goal of deviation/anomaly detection is to detect significant deviations from normal behavior
23
Deviation/Anomaly Detection:Definition
Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining dataThis can be viewed as two sub problems Define what data can be considered as
inconsistent in a given data set Find an efficient method to mine the outliers
24
Deviation:Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions
Approach: Based on past usage patterns, develop model for
authorized credit card transactions Check for deviation from model, before authenticating
new credit card transactions Hold payment and verify authenticity of “doubtful”
transaction by other means (phone call, etc.)
25
Anomaly detection:Network Intrusion Detection
Goal: to detect intrusion of a computer networkApproach: Define and develop a model for normal user
behavior on the computer network Continuously monitor behavior of users to
check if it deviates from the defined normal behavior
Raise an alarm, if such deviation is found
26
Sequential pattern discovery:definition
Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events
Sequence discovery aims at extracting sets of events that commonly occur over a period of time
(A B) (C) (D E)
27
Sequential pattern discovery:Telecommunication Alarm Logs
Telecommunication alarm logs (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) (Fire_Alarm)
28
Sequential pattern discovery:Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences Computer bookstore
(Intro_to_Visual_C) (C++ Primer) (Perl_For_Dummies, Tcl_Tk)
60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month
Athletic apparel store (Shoes) (Racket, Racket ball) (Sport_Jacket)
29
Example: Data Mining(Weather data)
By applying various data mining techniques, we can find associations and regularities in our data Extract knowledge in the forms of rules, decision trees
etc. Predict the value of the dependent variable in new
situation
Some example Mining association rules Classification by decision trees and rules Prediction methods
30
Mining association rules
First, discretize the numeric attributes (a part of the data preprocessing stage)Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal)Substitute the values in data with the corresponding namesApply the Apriori algorithm and get the following rules
31
Discretized weather dataDay outlook temperature humidity windy play
1 sunny hot high false No
2 sunny hot high true No
3 overcast hot high False Yes
4 rainy mild high False Yes
5 rainy cool normal False Yes
6 rainy cool normal True No
7 overcast cool normal True Yes
8 sunny mild high False No
9 sunny cool normal False Yes
10 rainy mild normal False Yes
11 sunny mild normal True Yes
12 overcast mild high True Yes
13 overcast hot normal False Yes
14 rainy mild high true no
32
Cont’d
1. humidity=normal windy=false play=yes (4,1)2. temperature=cool humidity=normal (4,1)3. outlook=overcast play=yes (4,1)4. temperature=cool play=yes humidity=normal (3,1)5. outlook=rainy windy=false play=yes (3, 1)6. outlook=rainy play=yes windy=false (3, 1)7. outlook=sunny humidity=high play=no (3, 1)8. outlook=sunny play=no humidity=high (3, 1)9. temperature=cool windy=false humidity=normal play=yes (2,
1)10. temperature=cool humidity=normal windy=false play=yes (2,
1)
33
Cont’d
These rules show some attribute values sets (itemsets) that appear frequently in the data
Support (the number of occurrences of the itemset in the data)
Confidence (accuracy) of the rules
Rule 3 – the same as the one that is produced by observing the data cube
34
Classification by Decision Trees and Rules
Using ID3 algorithm, the following decision tree is producedOutlook=sunny Humidity=high:no Humidity=normal:yes
Outlook=overcast:yesOutlook=rainy Windy=true:no Windy=false:yes
35
Cont’d
Decision tree consists of: Decision nodes that test the values of their
corresponding attribute Each value of this attribute leads to a subtree and so on,
until the leaves of the tree are reached They determine the value of the dependent variable
Using a decision tree we can classify new tuples
36
Cont’d
A decision tree can be presented as a set of rules Each rule represents a path through the tree from the root to
a leaf
Other data mining techniques can produce rules directly: Prism algorithmif outlook=overcast then yesif humidity=normal and windy=false then yesIf temperature=mild and humidity=normal the yesIf outlook=rainy and windy=false then yesIf outlook=sunny and humidity=high then noIf outlook=rainy and windy=true then no
37
Prediction methods
DM offers techniques to predict the value of the dependent variable directly without first generating a model
The most popular approaches is based of statistical methods
Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables
38
Cont’d
Eg: applying Bayes to the new tuple:(sunny, mild, normal, false, ?)
P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2
The predicted value must be “yes”
39
Data Mining : Problems and Challenges
Noisy data
Difficult Training
Set
Incomplete Data
Dynamic Databases
Large Databases
40
Noisy data
many of attribute values will be inexact or incorrect erroneous instruments measuring some property human errors occurring at data entry
two forms of noise in the data corrupted values - some of the values in the training set
are altered from the original form missing values - one or more of the attribute values
may be missing both for examples in the training set and for object which are to be classified.
41
Difficult Training Set
Non-representative data Learning are based on a few examples Using large db, the rules probably representative
Absence of boundary cases To find the real differences between two classes
Limited information Two objects to be classified give the same conditional
attributes but are classified in the diff class Not have enough information of distinguishing two
types of objects
42
Dynamic databases
Db change continually
Rules that reflect the content of the db at all time (preferred)
If same changes are made, the whole learning process may have to be conducted again
43
Large databases
The size of db to be ever increasing
Machine learning algorithms – handling a small training set (a few hundred examples)
Much care on using similar techniques in larger db
Large db – provide more knowledge (eg. rules may be enormous)
44
Data Mining – Issues in Data Mining
User Interaction / Visualization
Incorporation of Background Knowledge
Noisy or Incomplete Data
Determining Interestingness of Patterns
Efficiency and Scalability
Parallel and Distributed Mining
Incremental Learning / Mining Time-Changing Phenomena
Mining from Image / Video / Audio Data
Mining Unstructured Data