data mining introductory

CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 11

DATA MININGDATA MININGIntroductoryIntroductory

Dr. Mohammed AlhaddadDr. Mohammed Alhaddad

Collage of Information TechnologyCollage of Information Technology

King AbdulAziz UniversityKing AbdulAziz University

CS483CS483


Data Mining OutlineData Mining Outline

PART IPART I– IntroductionIntroduction– Related ConceptsRelated Concepts– Data Mining TechniquesData Mining Techniques

PART IIPART II– ClassificationClassification– ClusteringClustering– Association RulesAssociation Rules

PART IIIPART III– Web MiningWeb Mining– Spatial MiningSpatial Mining– Temporal MiningTemporal Mining


Goal:Goal: Provide an overview of data mining Provide an overview of data mining

Define data miningDefine data mining

Data mining vs. databasesData mining vs. databases

Basic data mining tasksBasic data mining tasks

Data mining developmentData mining development

Data mining issuesData mining issues


IntroductionIntroduction

Data is growing at a phenomenal rateData is growing at a phenomenal rate

Users expect more sophisticated Users expect more sophisticated informationinformation

How?How?

UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION

DATA MININGDATA MINING


Data Mining DefinitionData Mining Definition

Finding hidden information in a database. Finding hidden information in a database.

Fit data to a modelFit data to a model

Similar termsSimilar terms– Exploratory data analysisExploratory data analysis– Data driven discoveryData driven discovery– Deductive learningDeductive learning


What is (not) Data Mining?What is (not) Data Mining? What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”


Data Mining AlgorithmData Mining Algorithm

Objective: Fit Data to a ModelObjective: Fit Data to a Model– DescriptiveDescriptive– PredictivePredictive

Preference – Technique to choose the Preference – Technique to choose the best modelbest model

Search – Technique to search the dataSearch – Technique to search the data– ““Query”Query”


DB Processing vs. Data Mining DB Processing vs. Data Mining ProcessingProcessing

QueryQuery– Well definedWell defined– SQLSQL

QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of databaseSubset of database

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database


Query ExamplesQuery Examples

DatabaseDatabase

Data MiningData Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.

– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)


Data Mining Models and TasksData Mining Models and Tasks


Data Mining TasksData Mining Tasks

Prediction MethodsPrediction Methods– Use some variables to predict unknown or Use some variables to predict unknown or

future values of other variables.future values of other variables.

Description MethodsDescription Methods– Find human-interpretable patterns that Find human-interpretable patterns that

describe the data.describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


Data Mining Tasks...Data Mining Tasks...

1.1. Classification Classification [Predictive][Predictive]

2.2. Clustering Clustering [Descriptive][Descriptive]

3.3. Association Rule Discovery Association Rule Discovery [Descriptive][Descriptive]

4.4. Sequential Pattern Discovery Sequential Pattern Discovery [Descriptive][Descriptive]

5.5. Regression Regression [Predictive][Predictive]

6.6. Deviation Detection Deviation Detection [Predictive][Predictive]


Classification: DefinitionClassification: DefinitionGiven a collection of records (Given a collection of records (training set training set ))– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the

attributes is the attributes is the classclass..

Find a Find a modelmodel for class attribute as a for class attribute as a function of the values of other attributes.function of the values of other attributes.Goal: Goal: previously unseenpreviously unseen records should be records should be assigned a class as accurately as possible.assigned a class as accurately as possible.– A A test settest set is used to determine the accuracy of the is used to determine the accuracy of the

model. Usually, the given data set is divided into model. Usually, the given data set is divided into training and test sets, with training set used to build the training and test sets, with training set used to build the model and test set used to validate it.model and test set used to validate it.


Classification ExampleClassification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier


Classification: Application 1Classification: Application 1

Direct MarketingDirect Marketing– Goal: Reduce cost of mailing by Goal: Reduce cost of mailing by targetingtargeting a set of a set of

consumers likely to buy a new cell-phone product.consumers likely to buy a new cell-phone product.– Approach:Approach:

Use the data for a similar product introduced before. Use the data for a similar product introduced before.

We know which customers decided to buy and which decided We know which customers decided to buy and which decided otherwise. This otherwise. This {buy, don’t buy}{buy, don’t buy} decision forms the decision forms the class class attributeattribute..

Collect various demographic, lifestyle, and company-interaction Collect various demographic, lifestyle, and company-interaction related information about all such customers.related information about all such customers.

– Type of business, where they stay, how much they earn, etc.Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier model.Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997



Fraud DetectionFraud Detection– Goal: Predict fraudulent cases in credit card transactions.Goal: Predict fraudulent cases in credit card transactions.– Approach:Approach:

Use credit card transactions and the information on its account-Use credit card transactions and the information on its account-holder as attributes.holder as attributes.

– When does a customer buy, what does he buy, how often he pays on When does a customer buy, what does he buy, how often he pays on time, etctime, etc

Label past transactions as fraud or fair transactions. This forms Label past transactions as fraud or fair transactions. This forms the class attribute.the class attribute.Learn a model for the class of the transactions.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card Use this model to detect fraud by observing credit card transactions on an account.transactions on an account.



Customer Attrition/Churn:Customer Attrition/Churn:– Goal: To predict whether a customer is likely Goal: To predict whether a customer is likely

to be lost to a competitor.to be lost to a competitor.– Approach:Approach:

Use detailed record of transactions with each of Use detailed record of transactions with each of the past and present customers, to find attributes.the past and present customers, to find attributes.

– How often the customer calls, where he calls, what time-How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital of-the day he calls most, his financial status, marital status, etc. status, etc.

Label the customers as loyal or disloyal.Label the customers as loyal or disloyal.

Find a model for loyalty.Find a model for loyalty.From [Berry & Linoff] Data Mining Techniques, 1997



Sky Survey CatalogingSky Survey Cataloging– Goal: To predict class (star or galaxy) of sky objects, Goal: To predict class (star or galaxy) of sky objects,

especially visually faint ones, based on the telescopic especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).survey images (from Palomar Observatory).

– 3000 images with 23,040 x 23,040 pixels per image.3000 images with 23,040 x 23,040 pixels per image.

– Approach:Approach:Segment the image. Segment the image.

Measure image attributes (features) - 40 of them per object.Measure image attributes (features) - 40 of them per object.

Model the class based on these features.Model the class based on these features.

Success Story: Could find 16 new high red-shift quasars, Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996


Clustering DefinitionClustering Definition

Given a set of data points, each having a Given a set of data points, each having a set of attributes, and a similarity measure set of attributes, and a similarity measure among them, find clusters such thatamong them, find clusters such that– Data points in one cluster are more similar to Data points in one cluster are more similar to

one another.one another.– Data points in separate clusters are less similar Data points in separate clusters are less similar

to one another.to one another.

Similarity Measures:Similarity Measures:– Euclidean Distance if attributes are continuous.Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.Other Problem-specific Measures.


Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximized


Clustering: Application 1Clustering: Application 1

Market Segmentation:Market Segmentation:– Goal: subdivide a market into distinct subsets of Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be customers where any subset may conceivably be selected as a market target to be reached with a distinct selected as a market target to be reached with a distinct marketing mix.marketing mix.

– Approach: Approach: Collect different attributes of customers based on their Collect different attributes of customers based on their geographical and lifestyle related information.geographical and lifestyle related information.Find clusters of similar customers.Find clusters of similar customers.Measure the clustering quality by observing buying patterns of Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. customers in same cluster vs. those from different clusters.


Clustering: Application 2Clustering: Application 2

Document Clustering:Document Clustering:– Goal: To find groups of documents that are similar Goal: To find groups of documents that are similar

to each other based on the important terms to each other based on the important terms appearing in them.appearing in them.

– Approach: To identify frequently occurring terms in Approach: To identify frequently occurring terms in each document. Form a similarity measure based each document. Form a similarity measure based on the frequencies of different terms. Use it to on the frequencies of different terms. Use it to cluster.cluster.

– Gain: Information Retrieval can utilize the clusters Gain: Information Retrieval can utilize the clusters to relate a new document or search term to to relate a new document or search term to clustered documents.clustered documents.


Illustrating Document ClusteringIllustrating Document ClusteringClustering Points: 3204 Articles of Los Angeles Times.Clustering Points: 3204 Articles of Los Angeles Times.

Similarity Measure: How many words are common in Similarity Measure: How many words are common in these documents (after some word filtering).these documents (after some word filtering).

Category TotalArticles

CorrectlyPlaced

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278


Clustering of S&P 500 Stock Clustering of S&P 500 Stock DataData

Discovered Clusters Industry Group

1Applied-Matl-DOW N,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,

Sun-DOW N

Technology1-DOWN

2Apple-Comp-DOW N,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3Fannie-Mae-DOWN,Fed-Home-Loan-DOW N,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP

Oil-UP

Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the

events described by them frequently happen together on the same day.

We used association rules to quantify a similarity measure.


Association Rule Discovery: Association Rule Discovery: DefinitionDefinition

Given a set of records each of which contain Given a set of records each of which contain some number of items from a given collection;some number of items from a given collection;– Produce dependency rules which will predict Produce dependency rules which will predict

occurrence of an item based on occurrences of other occurrence of an item based on occurrences of other items.items.TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}


Association Rule Discovery: Association Rule Discovery: Application 1Application 1

Marketing and Sales Promotion:Marketing and Sales Promotion:– Let the rule discovered beLet the rule discovered be {Bagels, … } --> {Potato Chips}{Bagels, … } --> {Potato Chips}– Potato ChipsPotato Chips as consequentas consequent => => Can be used to Can be used to

determine what should be done to boost its sales.determine what should be done to boost its sales.– Bagels in the antecedentBagels in the antecedent => C => Can be used to see an be used to see

which products would be affected if the store which products would be affected if the store discontinues selling bagels.discontinues selling bagels.

– Bagels in antecedentBagels in antecedent andand Potato chips in consequentPotato chips in consequent => => Can be used to see what products should be sold Can be used to see what products should be sold with Bagels to promote sale of Potato chips!with Bagels to promote sale of Potato chips!



Supermarket shelf management.Supermarket shelf management.– Goal: To identify items that are bought together by Goal: To identify items that are bought together by

sufficiently many customers.sufficiently many customers.– Approach: Process the point-of-sale data collected Approach: Process the point-of-sale data collected

with barcode scanners to find dependencies with barcode scanners to find dependencies among items.among items.

– A classic rule --A classic rule --If a customer buys diaper and milk, then he is very likely If a customer buys diaper and milk, then he is very likely to buy beer.to buy beer.

So, don’t be surprised if you find six-packs stacked next So, don’t be surprised if you find six-packs stacked next to diapers!to diapers!



Inventory Management:Inventory Management:– Goal: A consumer appliance repair company wants to Goal: A consumer appliance repair company wants to

anticipate the nature of repairs on its consumer anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer right parts to reduce on number of visits to consumer households.households.

– Approach: Process the data on tools and parts Approach: Process the data on tools and parts required in previous repairs at different consumer required in previous repairs at different consumer locations and discover the co-occurrence patterns.locations and discover the co-occurrence patterns.


RegressionRegressionPredict a value of a given continuous valued variable Predict a value of a given continuous valued variable based on the values of other variables, assuming a based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.

Greatly studied in statistics, neural network fields.Greatly studied in statistics, neural network fields.

Examples:Examples:– Predicting sales amounts of new product based on Predicting sales amounts of new product based on

advetising expenditure.advetising expenditure.– Predicting wind velocities as a function of temperature, Predicting wind velocities as a function of temperature,

humidity, air pressure, etc.humidity, air pressure, etc.– Time series prediction of stock market indices.Time series prediction of stock market indices.


Basic Data Mining TasksBasic Data Mining TasksClassification Classification maps data into predefined groups maps data into predefined groups or classesor classes– Supervised learningSupervised learning– Pattern recognitionPattern recognition– PredictionPrediction

RegressionRegression is used to map a data item to a real is used to map a data item to a real valued prediction variable.valued prediction variable.Clustering Clustering groups similar data together into groups similar data together into clusters.clusters.– Unsupervised learningUnsupervised learning– SegmentationSegmentation– PartitioningPartitioning


Basic Data Mining Tasks Basic Data Mining Tasks (cont’d)(cont’d)

Summarization Summarization maps data into subsets with maps data into subsets with associated simple descriptions.associated simple descriptions.– CharacterizationCharacterization– GeneralizationGeneralization

Link AnalysisLink Analysis uncovers relationships among uncovers relationships among data.data.– Affinity AnalysisAffinity Analysis– Association RulesAssociation Rules– Sequential Analysis determines sequential patterns.Sequential Analysis determines sequential patterns.


Ex: Time Series AnalysisEx: Time Series AnalysisExample: Stock MarketExample: Stock MarketPredict future valuesPredict future valuesDetermine similar patterns over timeDetermine similar patterns over timeClassify behaviorClassify behavior


Data Mining vs. KDDData Mining vs. KDD

Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.

Data Mining:Data Mining: Use of algorithms to extract Use of algorithms to extract the information and patterns derived by the information and patterns derived by the KDD process. the KDD process.


KDD ProcessKDD Process

Selection:Selection: Obtain data from various sources. Obtain data from various sources.Preprocessing:Preprocessing: Cleanse data. Cleanse data.Transformation:Transformation: Convert to common format. Convert to common format. Transform to new format.Transform to new format.Data Mining:Data Mining: Obtain desired results. Obtain desired results.Interpretation/Evaluation:Interpretation/Evaluation: Present results Present results to user in meaningful manner.to user in meaningful manner.

Modified from [FPSS96C]


KDD Process Ex: Web LogKDD Process Ex: Web LogSelection:Selection: – Select log data (dates and locations) to useSelect log data (dates and locations) to use

Preprocessing:Preprocessing: – Remove identifying URLsRemove identifying URLs– Remove error logsRemove error logs

Transformation:Transformation: – Sessionize (sort and group)Sessionize (sort and group)

Data Mining:Data Mining: – Identify and count patternsIdentify and count patterns– Construct data structureConstruct data structure

Interpretation/Evaluation:Interpretation/Evaluation:– Identify and display frequently accessed sequences.Identify and display frequently accessed sequences.

Potential User Applications:Potential User Applications:– Cache predictionCache prediction– PersonalizationPersonalization


Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines

•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis

•Neural Networks•Decision Tree Algorithms

•Algorithm Design Techniques•Algorithm Analysis•Data Structures

•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques


KDD IssuesKDD Issues

Human InteractionHuman Interaction

OverfittingOverfitting

OutliersOutliers

InterpretationInterpretation

Visualization Visualization

Large DatasetsLarge Datasets

High DimensionalityHigh Dimensionality


KDD Issues (cont’d)KDD Issues (cont’d)

Multimedia DataMultimedia Data

Missing DataMissing Data

Irrelevant DataIrrelevant Data

Noisy DataNoisy Data

Changing DataChanging Data

IntegrationIntegration

ApplicationApplication


Challenges of Data MiningChallenges of Data MiningScalabilityScalability

DimensionalityDimensionality

Complex and Heterogeneous DataComplex and Heterogeneous Data

Data QualityData Quality

Data Ownership and DistributionData Ownership and Distribution

Privacy PreservationPrivacy Preservation

Streaming DataStreaming Data

data mining introductory

Documents