© prentice hall1 data mining introductory and advanced topics part ii margaret h. dunham department...

165
© Prentice Hall 1 DATA MINING DATA MINING Introductory and Advanced Topics Introductory and Advanced Topics Part II Part II Margaret H. Dunham Margaret H. Dunham Department of Computer Science and Department of Computer Science and Engineering Engineering Southern Methodist University Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics Data Mining, Introductory and Advanced Topics , Prentice , Prentice Hall, 2002. Hall, 2002.

Upload: abigayle-georgia-holland

Post on 19-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 1

DATA MININGDATA MININGIntroductory and Advanced TopicsIntroductory and Advanced Topics

Part II Part II

Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Southern Methodist UniversitySouthern Methodist University

Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.

Page 2: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 2

Data Mining OutlineData Mining Outline PART IPART I

– IntroductionIntroduction– Related ConceptsRelated Concepts– Data Mining TechniquesData Mining Techniques

PART IIPART II– ClassificationClassification– ClusteringClustering– Association RulesAssociation Rules

PART IIIPART III– Web MiningWeb Mining– Spatial MiningSpatial Mining– Temporal MiningTemporal Mining

Page 3: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 3

Classification OutlineClassification Outline

Classification Problem OverviewClassification Problem Overview Classification TechniquesClassification Techniques

– RegressionRegression– DistanceDistance– Decision TreesDecision Trees– RulesRules– Neural NetworksNeural Networks

Goal:Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms

Page 4: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 4

Classification ProblemClassification Problem Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} and a set } and a set

of classes C={Cof classes C={C11,…,C,…,Cmm}, the }, the Classification ProblemClassification Problem is to define a is to define a mapping f:Dmapping f:DC where each tC where each tii is assigned is assigned to one class.to one class.

Actually divides D into Actually divides D into equivalence equivalence classesclasses..

PredictionPrediction isis similar, but may be viewed similar, but may be viewed as having infinite number of classes.as having infinite number of classes.

Page 5: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 5

Classification ExamplesClassification Examples

Teachers classify students’ grades as Teachers classify students’ grades as A, B, C, D, or F. A, B, C, D, or F.

Identify mushrooms as poisonous or Identify mushrooms as poisonous or edible.edible.

Predict when a river will flood.Predict when a river will flood. Identify individuals with credit risks. Identify individuals with credit risks. Speech recognitionSpeech recognition Pattern recognitionPattern recognition

Page 6: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 6

Classification Ex: GradingClassification Ex: Grading

If x >= 90 then grade If x >= 90 then grade =A.=A.

If 80<=x<90 then If 80<=x<90 then grade =B.grade =B.

If 70<=x<80 then If 70<=x<80 then grade =C.grade =C.

If 60<=x<70 then If 60<=x<70 then grade =D.grade =D.

If x<50 then grade =F.If x<50 then grade =F.

>=90<90

x

>=80<80

x

>=70<70

x

F

B

A

>=60<50

x C

D

Page 7: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 7

Classification Ex: Letter Classification Ex: Letter RecognitionRecognition

View letters as constructed from 5 components:

Letter C

Letter E

Letter A

Letter D

Letter F

Letter B

Page 8: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 8

Classification TechniquesClassification Techniques

Approach:Approach:1.1. Create specific model by evaluating Create specific model by evaluating

training data (or using domain training data (or using domain experts’ knowledge).experts’ knowledge).

2.2. Apply model developed to new data.Apply model developed to new data. Classes must be predefinedClasses must be predefined Most common techniques use DTs, Most common techniques use DTs,

NNs, or are based on distances or NNs, or are based on distances or statistical methods.statistical methods.

Page 9: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 9

Defining ClassesDefining Classes

Partitioning Based

Distance Based

Page 10: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 10

Issues in ClassificationIssues in Classification

Missing DataMissing Data– IgnoreIgnore– Replace with assumed valueReplace with assumed value

Measuring PerformanceMeasuring Performance– Classification accuracy on test dataClassification accuracy on test data– Confusion matrixConfusion matrix– OC CurveOC Curve

Page 11: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 11

Height Example DataHeight Example DataName Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium

Page 12: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 12

Classification PerformanceClassification Performance

True Positive

True NegativeFalse Positive

False Negative

Page 13: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 13

Confusion Matrix ExampleConfusion Matrix Example

Using height data example with Output1 Using height data example with Output1 correct and Output2 actual assignmentcorrect and Output2 actual assignment

Actual Assignment Membership Short Medium Tall Short 0 4 0 Medium 0 5 3 Tall 0 1 2

Page 14: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 14

Operating Characteristic CurveOperating Characteristic Curve

Page 15: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 15

RegressionRegression Assume data fits a predefined functionAssume data fits a predefined function Determine best values for Determine best values for regression regression

coefficientscoefficients cc00,c,c11,…,c,…,cnn.. Assume an error: y = cAssume an error: y = c00+c+c11xx11+…+c+…+cnnxxnn+ Estimate error using mean squared error for

training set:

Page 16: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 16

Linear Regression Poor FitLinear Regression Poor Fit

Page 17: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 17

Classification Using RegressionClassification Using Regression

Division:Division: Use regression function to Use regression function to divide area into regions. divide area into regions.

PredictionPrediction: Use regression function to : Use regression function to predict a class membership function. predict a class membership function. Input includes desired class.Input includes desired class.

Page 18: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 18

DivisionDivision

Page 19: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 19

PredictionPrediction

Page 20: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 20

Classification Using DistanceClassification Using Distance Place items in class to which they are Place items in class to which they are

“closest”.“closest”. Must determine distance between an item Must determine distance between an item

and a class.and a class. Classes represented byClasses represented by

– Centroid:Centroid: Central value. Central value.– Medoid:Medoid: Representative point. Representative point.– Individual pointsIndividual points

Algorithm: KNNAlgorithm: KNN

Page 21: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 21

K Nearest Neighbor (KNN):K Nearest Neighbor (KNN):

Training set includes classes.Training set includes classes. Examine K items near item to be Examine K items near item to be

classified.classified. New item placed in class with the most New item placed in class with the most

number of close items.number of close items. O(q) for each tuple to be classified. O(q) for each tuple to be classified.

(Here q is the size of the training set.)(Here q is the size of the training set.)

Page 22: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 22

KNNKNN

Page 23: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 23

KNN AlgorithmKNN Algorithm

Page 24: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 24

Classification Using Decision Classification Using Decision TreesTrees

Partitioning based:Partitioning based: Divide search Divide search space into rectangular regions.space into rectangular regions.

Tuple placed into class based on the Tuple placed into class based on the region within which it falls.region within which it falls.

DT approaches differ in how the tree is DT approaches differ in how the tree is built: built: DT InductionDT Induction

Internal nodes associated with attribute Internal nodes associated with attribute and arcs with values for that attribute.and arcs with values for that attribute.

Algorithms: ID3, C4.5, CARTAlgorithms: ID3, C4.5, CART

Page 25: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 25

Decision TreeDecision TreeGiven: Given:

– D = {tD = {t11, …, t, …, tnn} where t} where tii=<t=<ti1i1, …, t, …, tihih> > – Database schema contains {ADatabase schema contains {A11, A, A22, …, A, …, Ahh}}– Classes C={CClasses C={C11, …., C, …., Cmm}}

Decision or Classification TreeDecision or Classification Tree is is a tree associated a tree associated with D such thatwith D such that– Each internal node is labeled with attribute, AEach internal node is labeled with attribute, A ii

– Each arc is labeled with predicate which can be Each arc is labeled with predicate which can be applied to attribute at parentapplied to attribute at parent

– Each leaf node is labeled with a class, CEach leaf node is labeled with a class, C jj

Page 26: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 26

DT InductionDT Induction

Page 27: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 27

DT Splits Area DT Splits Area

Gender

Height

M

F

Page 28: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 28

Comparing DTsComparing DTs

BalancedDeep

Page 29: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 29

DT IssuesDT Issues

Choosing Splitting AttributesChoosing Splitting Attributes Ordering of Splitting AttributesOrdering of Splitting Attributes SplitsSplits Tree StructureTree Structure Stopping CriteriaStopping Criteria Training DataTraining Data PruningPruning

Page 30: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 30

Decision Tree Induction is often based on Decision Tree Induction is often based on Information TheoryInformation Theory

SoSo

Page 31: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 31

InformationInformation

Page 32: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 32

DT Induction DT Induction

When all the marbles in the bowl are When all the marbles in the bowl are mixed up, little information is given. mixed up, little information is given.

When the marbles in the bowl are all When the marbles in the bowl are all from one class and those in the other from one class and those in the other two classes are on either side, more two classes are on either side, more information is given.information is given.

Use this approach with DT Induction !Use this approach with DT Induction !

Page 33: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 33

Information/EntropyInformation/Entropy Given probabilitites pGiven probabilitites p11, p, p22, .., p, .., pss whose sum is whose sum is

1, 1, EntropyEntropy is defined as:is defined as:

Entropy measures the amount of randomness Entropy measures the amount of randomness or surprise or uncertainty.or surprise or uncertainty.

Goal in classificationGoal in classification– no surpriseno surprise– entropy = 0entropy = 0

Page 34: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 34

EntropyEntropy

log (1/p) H(p,1-p)

Page 35: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 35

ID3ID3 Creates tree using information theory Creates tree using information theory

concepts and tries to reduce expected concepts and tries to reduce expected number of comparison..number of comparison..

ID3 chooses split attribute with the highest ID3 chooses split attribute with the highest information gain:information gain:

Page 36: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 36

ID3 Example (Output1)ID3 Example (Output1) Starting state entropy:Starting state entropy:4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.43844/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender:Gain using gender:

– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) =

0.43920.4392– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =

0.341520.34152– Gain: 0.4384 – 0.34152 = 0.09688Gain: 0.4384 – 0.34152 = 0.09688

Gain using height:Gain using height:0.4384 – (2/15)(0.301) = 0.39830.4384 – (2/15)(0.301) = 0.3983

Choose height as first splitting attributeChoose height as first splitting attribute

Page 37: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 37

C4.5C4.5 ID3 ID3 favors attributes with large number of favors attributes with large number of

divisionsdivisions Improved version of ID3:Improved version of ID3:

– Missing DataMissing Data– Continuous DataContinuous Data– PruningPruning– RulesRules– GainRatio:GainRatio:

Page 38: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 38

CARTCART

Create Binary TreeCreate Binary Tree Uses entropyUses entropy Formula to choose split point, s, for node t:Formula to choose split point, s, for node t:

PPLL,P,PRR probability that a tuple in the training set probability that a tuple in the training set

will be on the left or right side of the tree.will be on the left or right side of the tree.

Page 39: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 39

CART ExampleCART Example At the start, there are six choices for At the start, there are six choices for

split point split point (right branch on equality):(right branch on equality):– P(Gender)=P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.2242(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224

– P(1.6) = 0P(1.6) = 0– P(1.7) = P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.1692(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169

– P(1.8) = P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.3852(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385

– P(1.9) = P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.2562(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256

– P(2.0) = P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.322(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32

Split at 1.8Split at 1.8

Page 40: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 40

Classification Using Neural Classification Using Neural NetworksNetworks

Typical NN structure for classification:Typical NN structure for classification:– One output node per classOne output node per class– Output value is class membership function valueOutput value is class membership function value

Supervised learning Supervised learning For each tuple in training set, propagate it For each tuple in training set, propagate it

through NN. Adjust weights on edges to through NN. Adjust weights on edges to improve future classification. improve future classification.

Algorithms: Propagation, Backpropagation, Algorithms: Propagation, Backpropagation, Gradient DescentGradient Descent

Page 41: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 41

NN Issues NN Issues

Number of source nodesNumber of source nodes Number of hidden layersNumber of hidden layers Training dataTraining data Number of sinksNumber of sinks InterconnectionsInterconnections WeightsWeights Activation FunctionsActivation Functions Learning TechniqueLearning Technique When to stop learningWhen to stop learning

Page 42: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 42

Decision Tree vs. Neural Decision Tree vs. Neural NetworkNetwork

Page 43: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 43

PropagationPropagation

Tuple Input

Output

Page 44: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 44

NN Propagation AlgorithmNN Propagation Algorithm

Page 45: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 45

Example PropagationExample Propagation

© Prentie Hall

Page 46: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 46

NN LearningNN Learning

Adjust weights to perform better with the Adjust weights to perform better with the associated test data.associated test data.

Supervised:Supervised: Use feedback from Use feedback from knowledge of correct classification.knowledge of correct classification.

Unsupervised:Unsupervised: No knowledge of No knowledge of correct classification needed.correct classification needed.

Page 47: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 47

NN Supervised LearningNN Supervised Learning

Page 48: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 48

Supervised LearningSupervised Learning

Possible error values assuming output from Possible error values assuming output from node i is ynode i is yii but should be d but should be d ii::

Change weights on arcs based on estimated Change weights on arcs based on estimated errorerror

Page 49: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 49

NN BackpropagationNN Backpropagation

Propagate changes to weights Propagate changes to weights backward from output layer to input backward from output layer to input layer.layer.

Delta Rule:Delta Rule: w wijij= c x= c xijij (d (dj j – y– yjj)) Gradient Descent:Gradient Descent: technique to modify technique to modify

the weights in the graph.the weights in the graph.

Page 50: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 50

BackpropagationBackpropagation

Error

Page 51: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 51

Backpropagation AlgorithmBackpropagation Algorithm

Page 52: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 52

Gradient DescentGradient Descent

Page 53: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 53

Gradient Descent AlgorithmGradient Descent Algorithm

Page 54: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 54

Output Layer LearningOutput Layer Learning

Page 55: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 55

Hidden Layer LearningHidden Layer Learning

Page 56: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 56

Types of NNsTypes of NNs

Different NN structures used for Different NN structures used for different problems.different problems.

PerceptronPerceptron Self Organizing Feature MapSelf Organizing Feature Map Radial Basis Function NetworkRadial Basis Function Network

Page 57: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 57

PerceptronPerceptron

Perceptron is one of the simplest NNs.Perceptron is one of the simplest NNs. No hidden layers.No hidden layers.

Page 58: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 58

Perceptron ExamplePerceptron Example

Suppose:Suppose:– Summation: S=3xSummation: S=3x11+2x+2x22-6-6

– Activation: if S>0 then 1 else 0Activation: if S>0 then 1 else 0

Page 59: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 59

Self Organizing Feature Map Self Organizing Feature Map (SOFM)(SOFM)

Competitive Unsupervised LearningCompetitive Unsupervised Learning Observe how neurons work in brain:Observe how neurons work in brain:

– Firing impacts firing of those nearFiring impacts firing of those near– Neurons far apart inhibit each otherNeurons far apart inhibit each other– Neurons have specific nonoverlapping Neurons have specific nonoverlapping

taskstasks Ex: Kohonen NetworkEx: Kohonen Network

Page 60: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 60

Kohonen NetworkKohonen Network

Page 61: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 61

Kohonen NetworkKohonen Network

Competitive Layer – viewed as 2D gridCompetitive Layer – viewed as 2D grid Similarity between competitive nodes and Similarity between competitive nodes and

input nodes:input nodes:– Input: X = <xInput: X = <x11, …, x, …, xhh>>

– Weights: <wWeights: <w1i1i, … , w, … , whihi>>

– Similarity defined based on dot productSimilarity defined based on dot product

Competitive node most similar to input “wins”Competitive node most similar to input “wins” Winning node weights (as well as surrounding Winning node weights (as well as surrounding

node weights) increased.node weights) increased.

Page 62: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 62

Radial Basis Function NetworkRadial Basis Function Network

RBF function has Gaussian shapeRBF function has Gaussian shape RBF NetworksRBF Networks

– Three LayersThree Layers– Hidden layer – Gaussian activation Hidden layer – Gaussian activation

functionfunction– Output layer – Linear activation functionOutput layer – Linear activation function

Page 63: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 63

Radial Basis Function NetworkRadial Basis Function Network

Page 64: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 64

Classification Using RulesClassification Using Rules Perform classification using If-Then Perform classification using If-Then

rulesrules Classification Rule:Classification Rule: r = <a,c> r = <a,c>

Antecedent, ConsequentAntecedent, Consequent May generate from from other May generate from from other

techniques (DT, NN) or generate techniques (DT, NN) or generate directly.directly.

Algorithms: Gen, RX, 1R, PRISMAlgorithms: Gen, RX, 1R, PRISM

Page 65: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 65

Generating Rules from DTsGenerating Rules from DTs

Page 66: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 66

Generating Rules ExampleGenerating Rules Example

Page 67: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 67

Generating Rules from NNsGenerating Rules from NNs

Page 68: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 68

1R Algorithm1R Algorithm

Page 69: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 69

1R Example1R Example

Page 70: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 70

PRISM AlgorithmPRISM Algorithm

Page 71: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 71

PRISM ExamplePRISM Example

Page 72: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 72

Decision Tree vs. Rules Decision Tree vs. Rules

Tree has implied Tree has implied order in which order in which splitting is splitting is performed.performed.

Tree created based Tree created based on looking at all on looking at all classes.classes.

Rules have no Rules have no ordering of ordering of predicates.predicates.

Only need to look at Only need to look at one class to one class to generate its rules.generate its rules.

Page 73: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 73

Clustering OutlineClustering Outline

Clustering Problem OverviewClustering Problem Overview Clustering TechniquesClustering Techniques

– Hierarchical AlgorithmsHierarchical Algorithms– Partitional AlgorithmsPartitional Algorithms– Genetic AlgorithmGenetic Algorithm– Clustering Large DatabasesClustering Large Databases

Goal:Goal: Provide an overview of the clustering Provide an overview of the clustering problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms

Page 74: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 74

Clustering ExamplesClustering Examples

SegmentSegment customer database based on customer database based on similar buying patterns.similar buying patterns.

Group houses in a town into Group houses in a town into neighborhoods based on similar neighborhoods based on similar features.features.

Identify new plant speciesIdentify new plant species Identify similar Web usage patternsIdentify similar Web usage patterns

Page 75: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 75

Clustering ExampleClustering Example

Page 76: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 76

Clustering HousesClustering Houses

Size BasedGeographic Distance Based

Page 77: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 77

Clustering vs. ClassificationClustering vs. Classification

No prior knowledgeNo prior knowledge– Number of clustersNumber of clusters– Meaning of clustersMeaning of clusters

Unsupervised learningUnsupervised learning

Page 78: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 78

Clustering IssuesClustering Issues

Outlier handlingOutlier handling Dynamic dataDynamic data Interpreting resultsInterpreting results Evaluating resultsEvaluating results Number of clustersNumber of clusters Data to be usedData to be used ScalabilityScalability

Page 79: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 79

Impact of Outliers on Impact of Outliers on ClusteringClustering

Page 80: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 80

Clustering ProblemClustering Problem

Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} of tuples } of tuples and an integer value k, the and an integer value k, the Clustering Clustering ProblemProblem is to define a mapping is to define a mapping f:Df:D{1,..,k} where each t{1,..,k} where each tii is assigned to is assigned to one cluster Kone cluster Kjj, 1<=j<=k., 1<=j<=k.

A A ClusterCluster, K, Kjj, contains precisely those , contains precisely those tuples mapped to it.tuples mapped to it.

Unlike classification problem, clusters Unlike classification problem, clusters are not known a priori.are not known a priori.

Page 81: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 81

Types of Clustering Types of Clustering

HierarchicalHierarchical – Nested set of clusters – Nested set of clusters created.created.

Partitional Partitional – One set of clusters – One set of clusters created.created.

Incremental Incremental – Each element handled – Each element handled one at a time.one at a time.

SimultaneousSimultaneous – All elements handled – All elements handled together.together.

Overlapping/Non-overlappingOverlapping/Non-overlapping

Page 82: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 82

Clustering ApproachesClustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

Page 83: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 83

Cluster ParametersCluster Parameters

Page 84: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 84

Distance Between ClustersDistance Between Clusters Single LinkSingle Link: smallest distance between points: smallest distance between points Complete Link:Complete Link: largest distance between points largest distance between points Average Link:Average Link: average distance between pointsaverage distance between points Centroid:Centroid: distance between centroidsdistance between centroids

Page 85: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 85

Hierarchical ClusteringHierarchical Clustering

Clusters are created in levels actually Clusters are created in levels actually creating sets of clusters at each level.creating sets of clusters at each level.

AgglomerativeAgglomerative– Initially each item in its own clusterInitially each item in its own cluster– Iteratively clusters are merged togetherIteratively clusters are merged together– Bottom UpBottom Up

DivisiveDivisive– Initially all items in one clusterInitially all items in one cluster– Large clusters are successively dividedLarge clusters are successively divided– Top DownTop Down

Page 86: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 86

Hierarchical AlgorithmsHierarchical Algorithms

Single LinkSingle Link MST Single LinkMST Single Link Complete LinkComplete Link Average LinkAverage Link

Page 87: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 87

DendrogramDendrogram

Dendrogram:Dendrogram: a tree data a tree data structure which illustrates structure which illustrates hierarchical clustering hierarchical clustering techniques.techniques.

Each level shows clusters Each level shows clusters for that level.for that level.– Leaf – individual clustersLeaf – individual clusters– Root – one clusterRoot – one cluster

A cluster at level i is the A cluster at level i is the union of its children clusters union of its children clusters at level i+1.at level i+1.

Page 88: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 88

Levels of ClusteringLevels of Clustering

Page 89: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 89

Agglomerative ExampleAgglomerative ExampleAA BB CC DD EE

AA 00 11 22 22 33

BB 11 00 22 44 33

CC 22 22 00 11 55

DD 22 44 11 00 33

EE 33 33 55 33 00

BA

E C

D

4

Threshold of

2 3 51

A B C D E

Page 90: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 90

MST ExampleMST Example

AA BB CC DD EE

AA 00 11 22 22 33

BB 11 00 22 44 33

CC 22 22 00 11 55

DD 22 44 11 00 33

EE 33 33 55 33 00

BA

E C

D

Page 91: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 91

Agglomerative AlgorithmAgglomerative Algorithm

Page 92: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 92

Single LinkSingle Link View all items with links (distances) View all items with links (distances)

between them.between them. Finds maximal connected components Finds maximal connected components

in this graph.in this graph. Two clusters are merged if there is at Two clusters are merged if there is at

least one edge which connects them.least one edge which connects them. Uses threshold distances at each level.Uses threshold distances at each level. Could be agglomerative or divisive.Could be agglomerative or divisive.

Page 93: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 93

MST Single Link AlgorithmMST Single Link Algorithm

Page 94: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 94

Single Link ClusteringSingle Link Clustering

Page 95: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 95

Partitional ClusteringPartitional Clustering

NonhierarchicalNonhierarchical Creates clusters in one step as Creates clusters in one step as

opposed to several steps.opposed to several steps. Since only one set of clusters is output, Since only one set of clusters is output,

the user normally has to input the the user normally has to input the desired number of clusters, k.desired number of clusters, k.

Usually deals with static sets.Usually deals with static sets.

Page 96: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 96

Partitional AlgorithmsPartitional Algorithms

MSTMST Squared ErrorSquared Error K-MeansK-Means Nearest NeighborNearest Neighbor PAMPAM BEABEA GAGA

Page 97: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 97

MST AlgorithmMST Algorithm

Page 98: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 98

Squared ErrorSquared Error

Minimized squared errorMinimized squared error

Page 99: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 99

Squared Error AlgorithmSquared Error Algorithm

Page 100: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 100

K-MeansK-Means Initial set of clusters randomly chosen.Initial set of clusters randomly chosen. Iteratively, items are moved among sets Iteratively, items are moved among sets

of clusters until the desired set is of clusters until the desired set is reached.reached.

High degree of similarity among High degree of similarity among elements in a cluster is obtained.elements in a cluster is obtained.

Given a cluster KGiven a cluster Kii={t={ti1i1,t,ti2i2,…,t,…,timim}, the }, the

cluster meancluster mean is m is mii = (1/m)(t = (1/m)(ti1i1 + … + t + … + timim))

Page 101: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 101

K-Means ExampleK-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: mRandomly assign means: m11=3,m=3,m22=4=4 KK11={2,3}, K={2,3}, K22={4,10,12,20,30,11,25}, ={4,10,12,20,30,11,25},

mm11=2.5,m=2.5,m22=16=16 KK11={2,3,4},K={2,3,4},K22={10,12,20,30,11,25}, m={10,12,20,30,11,25}, m11=3,m=3,m22=18=18 KK11={2,3,4,10},K={2,3,4,10},K22={12,20,30,11,25}, ={12,20,30,11,25},

mm11=4.75,m=4.75,m22=19.6=19.6 KK11={2,3,4,10,11,12},K={2,3,4,10,11,12},K22={20,30,25}, m={20,30,25}, m11=7,m=7,m22=25=25 Stop as the clusters with these means are the Stop as the clusters with these means are the

same.same.

Page 102: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 102

K-Means AlgorithmK-Means Algorithm

Page 103: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 103

Nearest NeighborNearest Neighbor

Items are iteratively merged into the Items are iteratively merged into the existing clusters that are closest.existing clusters that are closest.

IncrementalIncremental Threshold, t, used to determine if items Threshold, t, used to determine if items

are added to existing clusters or a new are added to existing clusters or a new cluster is created.cluster is created.

Page 104: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 104

Nearest Neighbor AlgorithmNearest Neighbor Algorithm

Page 105: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 105

PAMPAM

Partitioning Around Medoids (PAM) Partitioning Around Medoids (PAM) (K-Medoids)(K-Medoids)

Handles outliers well.Handles outliers well. Ordering of input does not impact results.Ordering of input does not impact results. Does not scale well.Does not scale well. Each cluster represented by one item, Each cluster represented by one item,

called the called the medoid.medoid. Initial set of k medoids randomly chosen.Initial set of k medoids randomly chosen.

Page 106: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 106

PAMPAM

Page 107: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 107

PAM Cost CalculationPAM Cost Calculation At each step in algorithm, medoids are At each step in algorithm, medoids are

changed if the overall cost is improved.changed if the overall cost is improved. CCjihjih – cost change for an item t – cost change for an item t jj associated associated

with swapping medoid twith swapping medoid t ii with non-medoid t with non-medoid thh..

Page 108: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 108

PAM AlgorithmPAM Algorithm

Page 109: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 109

BEABEA Bond Energy AlgorithmBond Energy Algorithm Database design (physical and logical)Database design (physical and logical) Vertical fragmentationVertical fragmentation Determine affinity (bond) between attributes Determine affinity (bond) between attributes

based on common usage.based on common usage. Algorithm outline:Algorithm outline:

1.1. Create affinity matrixCreate affinity matrix

2.2. Convert to BOND matrix Convert to BOND matrix

3.3. Create regions of close bondingCreate regions of close bonding

Page 110: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 110

BEABEA

Modified from [OV99]

Page 111: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 111

Genetic Algorithm ExampleGenetic Algorithm Example

{{A,B,C,D,E,F,G,H}A,B,C,D,E,F,G,H} Randomly choose initial solution:Randomly choose initial solution:

{A,C,E} {B,F} {D,G,H} or{A,C,E} {B,F} {D,G,H} or10101000, 01000100, 0001001110101000, 01000100, 00010011

Suppose crossover at point four and Suppose crossover at point four and choose 1choose 1stst and 3 and 3rdrd individuals: individuals:10100011, 01000100, 0001100010100011, 01000100, 00011000

What should termination criteria be?What should termination criteria be?

Page 112: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 112

GA AlgorithmGA Algorithm

Page 113: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 113

Clustering Large DatabasesClustering Large Databases

Most clustering algorithms assume a large Most clustering algorithms assume a large data structure which is memory resident.data structure which is memory resident.

Clustering may be performed first on a Clustering may be performed first on a sample of the database then applied to the sample of the database then applied to the entire database.entire database.

AlgorithmsAlgorithms– BIRCHBIRCH– DBSCANDBSCAN– CURECURE

Page 114: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 114

Desired Features for Large Desired Features for Large DatabasesDatabases

One scan (or less) of DBOne scan (or less) of DB OnlineOnline Suspendable, stoppable, resumableSuspendable, stoppable, resumable IncrementalIncremental Work with limited main memoryWork with limited main memory Different techniques to scan (e.g. Different techniques to scan (e.g.

sampling)sampling) Process each tuple onceProcess each tuple once

Page 115: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 115

BIRCHBIRCH Balanced Iterative Reducing and Balanced Iterative Reducing and

Clustering using HierarchiesClustering using Hierarchies Incremental, hierarchical, one scanIncremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains Each entry in the tree contains

information about one clusterinformation about one cluster New nodes inserted in closest entry in New nodes inserted in closest entry in

treetree

Page 116: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 116

Clustering FeatureClustering Feature CT Triple: (N,LS,SS)CT Triple: (N,LS,SS)

– N: Number of points in clusterN: Number of points in cluster– LS: Sum of points in the clusterLS: Sum of points in the cluster– SS: Sum of squares of points in the clusterSS: Sum of squares of points in the cluster

CF TreeCF Tree– Balanced search treeBalanced search tree– Node has CF triple for each childNode has CF triple for each child– Leaf node represents cluster and has CF value Leaf node represents cluster and has CF value

for each subcluster in it.for each subcluster in it.– Subcluster has maximum diameterSubcluster has maximum diameter

Page 117: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 117

BIRCH AlgorithmBIRCH Algorithm

Page 118: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 118

Improve ClustersImprove Clusters

Page 119: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 119

DBSCANDBSCAN

Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with Noise

Outliers will not effect creation of cluster.Outliers will not effect creation of cluster. InputInput

– MinPts MinPts – minimum number of points in – minimum number of points in clustercluster

– EpsEps – for each point in cluster there must – for each point in cluster there must be another point in it less than this distance be another point in it less than this distance away.away.

Page 120: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 120

DBSCAN Density ConceptsDBSCAN Density Concepts

Eps-neighborhood:Eps-neighborhood: Points within Eps Points within Eps distance of a point.distance of a point.

Core point:Core point: Eps-neighborhood dense enough Eps-neighborhood dense enough (MinPts)(MinPts)

Directly density-reachable:Directly density-reachable: A point p is A point p is directly density-reachable from a point q if the directly density-reachable from a point q if the distance is small (Eps) and q is a core point.distance is small (Eps) and q is a core point.

Density-reachable:Density-reachable: A point si density- A point si density-reachable form another point if there is a path reachable form another point if there is a path from one to the other consisting of only core from one to the other consisting of only core points.points.

Page 121: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 121

Density ConceptsDensity Concepts

Page 122: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 122

DBSCAN AlgorithmDBSCAN Algorithm

Page 123: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 123

CURECURE

Clustering Using RepresentativesClustering Using Representatives Use many points to represent a cluster Use many points to represent a cluster

instead of only oneinstead of only one Points will be well scatteredPoints will be well scattered

Page 124: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 124

CURE ApproachCURE Approach

Page 125: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 125

CURE AlgorithmCURE Algorithm

Page 126: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 126

CURE for Large DatabasesCURE for Large Databases

Page 127: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 127

Comparison of Clustering Comparison of Clustering TechniquesTechniques

Page 128: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 128

Association Rules OutlineAssociation Rules OutlineGoal: Provide an overview of basic Association Provide an overview of basic Association

Rule mining techniquesRule mining techniques Association Rules Problem OverviewAssociation Rules Problem Overview

– Large itemsetsLarge itemsets Association Rules AlgorithmsAssociation Rules Algorithms

– AprioriApriori– SamplingSampling– PartitioningPartitioning– Parallel AlgorithmsParallel Algorithms

Comparing TechniquesComparing Techniques Incremental AlgorithmsIncremental Algorithms Advanced AR TechniquesAdvanced AR Techniques

Page 129: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 129

Example: Market Basket DataExample: Market Basket Data Items frequently purchased together:Items frequently purchased together:

Bread Bread PeanutButterPeanutButter Uses:Uses:

– Placement Placement – AdvertisingAdvertising– SalesSales– CouponsCoupons

Objective: increase sales and reduce Objective: increase sales and reduce costscosts

Page 130: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 130

Association Rule DefinitionsAssociation Rule Definitions

Set of items:Set of items: I={I I={I11,I,I22,…,I,…,Imm}}

Transactions:Transactions: D={t D={t11,t,t22, …, t, …, tnn}, t}, tjj I I

Itemset:Itemset: {I {Ii1i1,I,Ii2i2, …, I, …, Iikik} } I I Support of an itemset:Support of an itemset: Percentage of Percentage of

transactions which contain that itemset.transactions which contain that itemset. Large (Frequent) itemset:Large (Frequent) itemset: Itemset Itemset

whose number of occurrences is above whose number of occurrences is above a threshold.a threshold.

Page 131: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 131

Association Rules ExampleAssociation Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Page 132: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 132

Association Rule DefinitionsAssociation Rule Definitions

Association Rule (AR): Association Rule (AR): implication X implication X Y where X,Y Y where X,Y I and X I and X Y = Y = ;;

Support of AR (s) X Support of AR (s) X YY: : Percentage of transactions that Percentage of transactions that contain X contain X YY

Confidence of AR (Confidence of AR () X ) X Y: Y: Ratio of Ratio of number of transactions that contain X number of transactions that contain X Y to the number that contain X Y to the number that contain X

Page 133: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 133

Association Rules Ex (cont’d)Association Rules Ex (cont’d)

Page 134: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 134

Association Rule ProblemAssociation Rule Problem Given a set of items I={IGiven a set of items I={I11,I,I22,…,I,…,Imm} and a } and a

database of transactions D={tdatabase of transactions D={t11,t,t22, …, t, …, tnn} } where twhere tii={I={Ii1i1,I,Ii2i2, …, I, …, Iikik} and I} and Iijij I, the I, the Association Rule ProblemAssociation Rule Problem is to is to identify all association rulesidentify all association rules X X Y Y with with a minimum support and confidence.a minimum support and confidence.

Link AnalysisLink Analysis NOTE:NOTE: Support of Support of X X Y Y is same as is same as

support of X support of X Y. Y.

Page 135: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 135

Association Rule TechniquesAssociation Rule Techniques

1.1. Find Large Itemsets.Find Large Itemsets.

2.2. Generate rules from frequent itemsets.Generate rules from frequent itemsets.

Page 136: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 136

Algorithm to Generate ARsAlgorithm to Generate ARs

Page 137: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 137

AprioriApriori

Large Itemset Property:Large Itemset Property:

Any subset of a large itemset is large.Any subset of a large itemset is large. Contrapositive:Contrapositive:

If an itemset is not large, If an itemset is not large,

none of its supersets are large.none of its supersets are large.

Page 138: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 138

Large Itemset PropertyLarge Itemset Property

Page 139: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 139

Apriori Ex (cont’d)Apriori Ex (cont’d)

s=30% = 50%

Page 140: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 140

Apriori AlgorithmApriori Algorithm

1.1. CC11 = Itemsets of size one in I; = Itemsets of size one in I;

2.2. Determine all large itemsets of size 1, LDetermine all large itemsets of size 1, L1;1;

3. i = 1;

4.4. RepeatRepeat

5.5. i = i + 1;i = i + 1;

6.6. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););

7.7. Count CCount Cii to determine L to determine L i;i;

8.8. until no more large itemsets found;until no more large itemsets found;

Page 141: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 141

Apriori-GenApriori-Gen

Generate candidates of size i+1 from Generate candidates of size i+1 from large itemsets of size i.large itemsets of size i.

Approach used: join large itemsets of Approach used: join large itemsets of size i if they agree on i-1 size i if they agree on i-1

May also prune candidates who have May also prune candidates who have subsets that are not large.subsets that are not large.

Page 142: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 142

Apriori-Gen ExampleApriori-Gen Example

Page 143: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 143

Apriori-Gen Example (cont’d)Apriori-Gen Example (cont’d)

Page 144: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 144

Apriori Adv/DisadvApriori Adv/Disadv

Advantages:Advantages:– Uses large itemset property.Uses large itemset property.– Easily parallelizedEasily parallelized– Easy to implement.Easy to implement.

Disadvantages:Disadvantages:– Assumes transaction database is memory Assumes transaction database is memory

resident.resident.– Requires up to m database scans.Requires up to m database scans.

Page 145: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 145

SamplingSampling Large databasesLarge databases Sample the database and apply Apriori to the Sample the database and apply Apriori to the

sample. sample. Potentially Large Itemsets (PL):Potentially Large Itemsets (PL): Large Large

itemsets from sampleitemsets from sample Negative Border (BD Negative Border (BD -- ): ):

– Generalization of Apriori-Gen applied to Generalization of Apriori-Gen applied to itemsets of varying sizes.itemsets of varying sizes.

– Minimal set of itemsets which are not in PL, Minimal set of itemsets which are not in PL, butbut whose subsets are all in PL. whose subsets are all in PL.

Page 146: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 146

Negative Border ExampleNegative Border Example

PL PL BD-(PL)

Page 147: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 147

Sampling AlgorithmSampling Algorithm

1.1. DDss = sample of Database D; = sample of Database D;

2.2. PL = Large itemsets in DPL = Large itemsets in Dss using smalls; using smalls;

3.3. C = PL C = PL BDBD--(PL);(PL);4.4. Count C in Database using s;Count C in Database using s;

5.5. ML = large itemsets in BDML = large itemsets in BD--(PL);(PL);6.6. If ML = If ML = then donethen done

7.7. else C = repeated application of BDelse C = repeated application of BD-;-;

8.8. Count C in Database;Count C in Database;

Page 148: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 148

Sampling ExampleSampling Example Find AR assuming s = 20%Find AR assuming s = 20% DDss = { t = { t11,t,t22}} Smalls = 10%Smalls = 10% PL = {{Bread}, {Jelly}, {PeanutButter}, PL = {{Bread}, {Jelly}, {PeanutButter},

{Bread,Jelly}, {Bread,PeanutButter}, {Jelly, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}PeanutButter}, {Bread,Jelly,PeanutButter}}

BDBD--(PL)={{Beer},{Milk}}(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} ML = {{Beer}, {Milk}} Repeated application of BDRepeated application of BD- - generates all generates all

remaining itemsetsremaining itemsets

Page 149: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 149

Sampling Adv/DisadvSampling Adv/Disadv

Advantages:Advantages:– Reduces number of database scans to one Reduces number of database scans to one

in the best case and two in worst.in the best case and two in worst.– Scales better.Scales better.

Disadvantages:Disadvantages:– Potentially large number of candidates in Potentially large number of candidates in

second passsecond pass

Page 150: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 150

PartitioningPartitioning

Divide database into partitions DDivide database into partitions D11,D,D22,,…,D…,Dpp

Apply Apriori to each partitionApply Apriori to each partition Any large itemset must be large in at Any large itemset must be large in at

least one partition.least one partition.

Page 151: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 151

Partitioning AlgorithmPartitioning Algorithm

1.1. Divide D into partitions DDivide D into partitions D11,D,D22,…,D,…,Dp;p;

2. For I = 1 to p do

3.3. LLii = Apriori(D = Apriori(Dii););

4.4. C = LC = L11 … … L Lpp;;

5.5. Count C on D to generate L;Count C on D to generate L;

Page 152: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 152

Partitioning ExamplePartitioning Example

D1

D2

S=10%

L1 ={{Bread}, {Jelly}, {Bread}, {Jelly}, {PeanutButter}, {PeanutButter}, {Bread,Jelly}, {Bread,Jelly}, {Bread,PeanutButter}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}}{Bread,Jelly,PeanutButter}}

L2 ={{Bread}, {Milk}, {Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, {Bread,PeanutButter}, {Milk, PeanutButter}, PeanutButter}, {Bread,Milk,PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer}, {Beer,Bread}, {Beer,Milk}}{Beer,Milk}}

Page 153: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 153

Partitioning Adv/DisadvPartitioning Adv/Disadv

Advantages:Advantages:– Adapts to available main memoryAdapts to available main memory– Easily parallelizedEasily parallelized– Maximum number of database scans is Maximum number of database scans is

two.two. Disadvantages:Disadvantages:

– May have many candidates during second May have many candidates during second scan.scan.

Page 154: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 154

Parallelizing AR AlgorithmsParallelizing AR Algorithms

Based on AprioriBased on Apriori Techniques differ:Techniques differ:

– What is counted at each siteWhat is counted at each site– How data (transactions) are distributedHow data (transactions) are distributed

Data ParallelismData Parallelism– Data partitionedData partitioned– Count Distribution AlgorithmCount Distribution Algorithm

Task ParallelismTask Parallelism– Data and candidates partitionedData and candidates partitioned– Data Distribution AlgorithmData Distribution Algorithm

Page 155: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 155

Count Distribution Algorithm(CDA)Count Distribution Algorithm(CDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. CC11 = Itemsets of size one in I; = Itemsets of size one in I;4.4. Count CCount C1;1;

5.5. Broadcast counts to all sites;Broadcast counts to all sites;6.6. Determine global large itemsets of size 1, LDetermine global large itemsets of size 1, L11;;7. i = 1; 8.8. RepeatRepeat9.9. i = i + 1;i = i + 1;10.10. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););11.11. Count CCount Ci;i;

12.12. Broadcast counts to all sites;Broadcast counts to all sites;13.13. Determine global large itemsets of size i, LDetermine global large itemsets of size i, L ii;;14.14. until no more large itemsets found;until no more large itemsets found;

Page 156: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 156

CDA ExampleCDA Example

Page 157: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 157

Data Distribution Algorithm(DDA)Data Distribution Algorithm(DDA)1.1. Place data partition at each site.Place data partition at each site.2.2. In Parallel at each site doIn Parallel at each site do3.3. Determine local candidates of size 1 to count;Determine local candidates of size 1 to count;4.4. Broadcast local transactions to other sites;Broadcast local transactions to other sites;5.5. Count local candidates of size 1 on all data;Count local candidates of size 1 on all data;6.6. Determine large itemsets of size 1 for local Determine large itemsets of size 1 for local

candidates; candidates; 7.7. Broadcast large itemsets to all sites;Broadcast large itemsets to all sites;8.8. Determine LDetermine L11;;9. i = 1; 10.10. RepeatRepeat11.11. i = i + 1;i = i + 1;12.12. CCi i = Apriori-Gen(L= Apriori-Gen(Li-1i-1););13.13. Determine local candidates of size i to count;Determine local candidates of size i to count;14.14. Count, broadcast, and find LCount, broadcast, and find Lii;;15.15. until no more large itemsets found;until no more large itemsets found;

Page 158: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 158

DDA ExampleDDA Example

Page 159: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 159

Comparing AR TechniquesComparing AR Techniques TargetTarget TypeType Data TypeData Type Data SourceData Source TechniqueTechnique Itemset Strategy and Data StructureItemset Strategy and Data Structure Transaction Strategy and Data StructureTransaction Strategy and Data Structure OptimizationOptimization ArchitectureArchitecture Parallelism StrategyParallelism Strategy

Page 160: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 160

Comparison of AR TechniquesComparison of AR Techniques

Page 161: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 161

Hash TreeHash Tree

Page 162: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 162

Incremental Association RulesIncremental Association Rules Generate ARs in a dynamic database.Generate ARs in a dynamic database. Problem: algorithms assume static Problem: algorithms assume static

databasedatabase Objective: Objective:

– Know large itemsets for DKnow large itemsets for D– Find large itemsets for D Find large itemsets for D { { D} D}

Must be large in either D or Must be large in either D or D D Save LSave Li i and counts and counts

Page 163: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 163

Note on ARsNote on ARs Many applications outside market Many applications outside market

basket data analysisbasket data analysis– Prediction (telecom switch failure)Prediction (telecom switch failure)– Web usage miningWeb usage mining

Many different types of association rulesMany different types of association rules– TemporalTemporal– SpatialSpatial– CausalCausal

Page 164: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 164

Advanced AR TechniquesAdvanced AR Techniques

Generalized Association RulesGeneralized Association Rules Multiple-Level Association RulesMultiple-Level Association Rules Quantitative Association RulesQuantitative Association Rules Using multiple minimum supportsUsing multiple minimum supports Correlation RulesCorrelation Rules

Page 165: © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist

© Prentice Hall 165

Measuring Quality of RulesMeasuring Quality of Rules

SupportSupport ConfidenceConfidence InterestInterest ConvictionConviction Chi Squared TestChi Squared Test