data mining: concepts and techniques - knowledge and...

1

January 20, 2006 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques

— Slides for Textbook —— potpourri —

©Jiawei Han and Micheline Kamber

http://www.cs.sfu.ca

Potpourri composed by

Yannis Theodoridis (May 2001)


Contents

IntroductionData Warehouses

Data PreprocessingData Mining Functionality

Association RulesClassificationClusteringTrend Analysis

Social ImpactA prototype system: DBMiner

2


What Is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases

Alternative names and their “inside stories”: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is not data mining?Expert systems or small ML/statistical programs


Data Mining ApplicationsData Mining Applications

Data mining is a young discipline with wide and diverse applications

a nontrivial gap exists between general principles of data mining and domain-specific, effective data mining tools for particular applications

Some application domains (covered in this chapter)

Biomedical and DNA data analysisFinancial data analysisRetail industryTelecommunication industry

3


Commercial Data Mining tools Commercial Data Mining tools

Commercial data mining systems have little in common

Different data mining functionality or methodology May even work with completely different kinds of data sets

Need multiple dimensional view in selectionData types: relational, transactional, text, time sequence, spatial?System issues

running on only one or on several operating systems?a client/server architecture?Provide Web-based interfaces and allow XML data as input and/or output?


Commercial Data Mining toolsCommercial Data Mining tools

Data sourcesASCII text files, multiple relational data sourcessupport ODBC connections (OLE DB, JDBC)?

Data mining functions and methodologiesOne vs. multiple data mining functionsOne vs. variety of methods per function

More data mining functions and methods per function provide the user with greater flexibility and analysis power

Coupling with DB and/or data warehouse systemsFour forms of coupling: no coupling, loose coupling, semitightcoupling, and tight coupling

Ideally, a data mining system should be tightly coupled with a database system

4


Commercial Data Mining toolsCommercial Data Mining tools

ScalabilityRow (or database size) scalabilityColumn (or dimension) scalabilityCurse of dimensionality: it is much more challenging to make a system column scalable that row scalable

Visualization tools“A picture is worth a thousand words”Visualization categories: data visualization, mining result visualization, mining process visualization, and visual data mining

Data mining query language and graphical user interfaceEasy-to-use and high-quality graphical user interface Essential for user-guided, highly interactive data mining


Examples of Data Mining Systems (1)Examples of Data Mining Systems (1)

IBM Intelligent MinerA wide range of data mining algorithmsScalable mining algorithmsToolkits: neural network algorithms, statistical methods, data preparation, and data visualization toolsTight integration with IBM's DB2 relational database system

Mirosoft SQLServer 2000Integrate DB and OLAP with miningSupport OLEDB for DM standard

5


Examples of Data Mining Systems (2)Examples of Data Mining Systems (2)

SGI MineSetMultiple data mining algorithms and advanced statisticsAdvanced visualization tools

SAS Enterprise MinerA variety of statistical analysis toolsData warehouse tools and multiple data mining algorithms

Clementine (SPSS)An integrated data mining development environment for end-users and developersMultiple data mining algorithms and visualization tools


Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

6


Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

BusinessAnalyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP


Architecture of a Typical DM System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

7


Data Mining Functionalities (1)

Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Association (correlation and causality)

Multi-dimensional vs. single-dimensional association

age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

contains(T, “computer”) contains(x, “software”) [1%, 75%]



Classification and PredictionFinding models (functions) that describe and distinguish classes or concepts for future predictionE.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values

Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

8



Outlier analysisOutlier: a data object that does not comply with the general behavior of the

data

It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

Trend and evolution analysisTrend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis

Other pattern-directed or statistical analyses


Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data

9


Association Rule: Basic Concepts

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of items

E.g., 98% of people who purchase tires and auto accessories also get automotive services done

Applications* ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)Home Electronics ⇒ * (What other products should the store stocks up?)Attached mailing in direct marketingDetecting “ping-pong”ing of patients, faulty “collisions”


Rule Measures: Support & Confidence

Find all the rules X & Y ⇒ Z with minimum confidence and support

support, s, probability that a transaction contains {X Y Z}confidence, c, conditional probabilitythat a transaction having {X Y} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

A ⇒ C (50%, 66.6%)C ⇒ A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

10


Visualization of Association Rule Using Plane Graph


Visualization of Association Rule Using Rule Graph

11


Rule Mining: A Road MapRule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled)buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%]age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysis

What brands of beers are associated with what brands of diapers?Various extensions

Correlation, causality analysisAssociation does not necessarily imply correlation or causality

Maxpatterns and closed itemsetsConstraints enforced

E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?


Mining Association RulesMining Association Rules——An An ExampleExample

For rule A ⇒ C:

support = support({A &C}) = 50%confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

12


Mining FrequentMining Frequent ItemsetsItemsets: the Key Step: the Key Step

Find the frequent itemsets: the sets of items that have minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

Use the frequent itemsets to generate association rules.


The The Apriori Apriori AlgorithmAlgorithm

Join Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetPseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;for each transaction t in database do

increment the count of all candidates in Ck+1that are contained in t

Lk+1 = candidates in Ck+1 with min_supportend

return ∪k Lk;

13


TheThe AprioriApriori Algorithm Algorithm —— ExampleExample

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2


Candidates GenerationCandidates Generation

Suppose the items in Lk-1 are listed in an orderStep 1Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2Step 2: pruning

forall itemsets c in Ck doforall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

14


How to Count Supports of Candidates?How to Count Supports of Candidates?

Why counting supports of candidates a problem?

The total number of candidates can be very hugeOne transaction may contain many candidates

Method:

Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction


Example of Generating CandidatesExample of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

15


Mining DistanceMining Distance--based Association Rulesbased Association Rules

Binning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering:

density/number of points in an interval“closeness” of points in an interval

Price($)Equi-width(width $10)

Equi-depth(depth 2)

Distance-based

7 [0,10] [7,20] [7,7]20 [11,20] [22,50] [20,22]22 [21,30] [51,53] [50,53]50 [31,40]51 [41,50]53 [51,60]


S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set XThe diameter of S[X]:

distx:distance metric, e.g. Euclidean distance or Manhattan

)1(

])[],[(])[( 1 1

−=∑ ∑= =

NN

XtXtdistXSd

jiN

i

N

jX

Clusters and Distance MeasurementsClusters and Distance Measurements

16


The diameter, d, assesses the density of a cluster CX , where

Finding clusters and distance-based rules

the density threshold, d0 , replaces the notion of support

modified version of the BIRCH clustering algorithm

XdCd X 0)( ≤

0sCX ≥

Clusters and Distance Measurements(Cont.)Clusters and Distance Measurements(Cont.)


Interestingness MeasurementsInterestingness Measurements

Objective measures

Two popular measurements: support; and confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)

A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)

17


Criticism to Support and ConfidenceCriticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98)

Among 5000 students3000 play basketball3750 eat cereal2000 both play basket ball and eat cereal

play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which ishigher than 66.7%.play basketball ⇒ not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000


Criticism to Support and Confidence Criticism to Support and Confidence

Example 2:X and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates

We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37,50% 75%)()(

)(, BPAP

BAPcorr BA∪

=

18


Other Interestingness Measures: InterestOther Interestingness Measures: Interest

Interest (correlation, lift)

taking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1; otherwise A and B

positively correlated

)()()(

BPAPBAP ∧

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57


Association Rules:SummaryAssociation Rules:Summary

Association rule mining

probably the most significant contribution from the database community in KDD

A large number of papers have been publishedMany interesting issues have been explored

An interesting research direction

Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

19


Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3


Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

20


Presentation of Classification Results


AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

21


DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected pointsDiscovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


ε

ε

Reachability-distance

Cluster-orderof the objects

undefined

ε‘

22


Constraint-Based Clustering Analysis

Clustering analysis: less parameters but more user-desired constraints, e.g., an

ATM allocation problem


Mining Time-Series and Sequence Data

Time-series plot

23


Mining Time-Series and Sequence Data: Trend analysis

A time series can be illustrated as a time-series graph which describes a point moving with the passage of time

Categories of Time-Series Movements

Long-term or trend movements (trend curve)

Cyclic movements or cycle variations, e.g., business cycles

Seasonal movements or seasonal variations

i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

Irregular or random movements


Social Impacts: Threat to Privacy and Data Security?

Is data mining a threat to privacy and data security?“Big Brother”, “Big Banker”, and “Big Business” are carefully watching youProfiling information is collected every time

You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the aboveYou surf the Web, reply to an Internet newsgroup, subscribe to amagazine, rent a video, join a club, fill out a contest entry form,You pay for prescription drugs, or present you medical care number when visiting the doctor

Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse

24


Protect Privacy and Data Security

Fair information practicesInternational guidelines for data privacy protectionCover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountabilityPurpose specification and use limitationOpenness: Individuals have the right to know what information is collected about them, who has access to the data, and how the data are being used

Develop and use data security-enhancing techniquesBlind signaturesBiometric encryptionAnonymous databases


OLAP (Summarization) Display Using MS/Excel 2000

25


Market-Basket-Analysis (Association)—Ball graph


Display of Association Rules in Rule Plane Form

26


Display of Decision Tree (Classification Results)


Display of Clustering (Segmentation) Results

27


3D Cube Browser


Trends in Data Mining (1)Trends in Data Mining (1)

Scalable data mining methodsConstraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns

Application explorationdevelopment of application-specific data mining systemInvisible data mining (mining as built-in function)

Integration of data mining with database systems, data warehousesystems, and Web database systemsQuality assessment

28


Trends in Data Mining (2)Trends in Data Mining (2)

Standardization of data mining languageA standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society

Visual data miningUncertainty handlingNew methods for mining complex types of data

More research is required towards the integration of data miningmethods with existing data analysis techniques for the complex types of data

Web miningPrivacy protection and information security in data mining

data mining: concepts and techniques - knowledge and...

Documents