data mining: concepts and techniques - knowledge and...
TRANSCRIPT
1
January 20, 2006 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques
— Slides for Textbook —— potpourri —
©Jiawei Han and Micheline Kamber
http://www.cs.sfu.ca
Potpourri composed by
Yannis Theodoridis (May 2001)
January 20, 2006 Data Mining: Concepts and Techniques 2
Contents
IntroductionData Warehouses
Data PreprocessingData Mining Functionality
Association RulesClassificationClusteringTrend Analysis
Social ImpactA prototype system: DBMiner
2
January 20, 2006 Data Mining: Concepts and Techniques 3
What Is Data Mining?
Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from data in large databases
Alternative names and their “inside stories”: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
What is not data mining?Expert systems or small ML/statistical programs
January 20, 2006 Data Mining: Concepts and Techniques 4
Data Mining ApplicationsData Mining Applications
Data mining is a young discipline with wide and diverse applications
a nontrivial gap exists between general principles of data mining and domain-specific, effective data mining tools for particular applications
Some application domains (covered in this chapter)
Biomedical and DNA data analysisFinancial data analysisRetail industryTelecommunication industry
3
January 20, 2006 Data Mining: Concepts and Techniques 5
Commercial Data Mining tools Commercial Data Mining tools
Commercial data mining systems have little in common
Different data mining functionality or methodology May even work with completely different kinds of data sets
Need multiple dimensional view in selectionData types: relational, transactional, text, time sequence, spatial?System issues
running on only one or on several operating systems?a client/server architecture?Provide Web-based interfaces and allow XML data as input and/or output?
January 20, 2006 Data Mining: Concepts and Techniques 6
Commercial Data Mining toolsCommercial Data Mining tools
Data sourcesASCII text files, multiple relational data sourcessupport ODBC connections (OLE DB, JDBC)?
Data mining functions and methodologiesOne vs. multiple data mining functionsOne vs. variety of methods per function
More data mining functions and methods per function provide the user with greater flexibility and analysis power
Coupling with DB and/or data warehouse systemsFour forms of coupling: no coupling, loose coupling, semitightcoupling, and tight coupling
Ideally, a data mining system should be tightly coupled with a database system
4
January 20, 2006 Data Mining: Concepts and Techniques 7
Commercial Data Mining toolsCommercial Data Mining tools
ScalabilityRow (or database size) scalabilityColumn (or dimension) scalabilityCurse of dimensionality: it is much more challenging to make a system column scalable that row scalable
Visualization tools“A picture is worth a thousand words”Visualization categories: data visualization, mining result visualization, mining process visualization, and visual data mining
Data mining query language and graphical user interfaceEasy-to-use and high-quality graphical user interface Essential for user-guided, highly interactive data mining
January 20, 2006 Data Mining: Concepts and Techniques 8
Examples of Data Mining Systems (1)Examples of Data Mining Systems (1)
IBM Intelligent MinerA wide range of data mining algorithmsScalable mining algorithmsToolkits: neural network algorithms, statistical methods, data preparation, and data visualization toolsTight integration with IBM's DB2 relational database system
Mirosoft SQLServer 2000Integrate DB and OLAP with miningSupport OLEDB for DM standard
5
January 20, 2006 Data Mining: Concepts and Techniques 9
Examples of Data Mining Systems (2)Examples of Data Mining Systems (2)
SGI MineSetMultiple data mining algorithms and advanced statisticsAdvanced visualization tools
SAS Enterprise MinerA variety of statistical analysis toolsData warehouse tools and multiple data mining algorithms
Clementine (SPSS)An integrated data mining development environment for end-users and developersMultiple data mining algorithms and visualization tools
January 20, 2006 Data Mining: Concepts and Techniques 10
Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
6
January 20, 2006 Data Mining: Concepts and Techniques 11
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
BusinessAnalyst
DataAnalyst
DBA
MakingDecisions
Data PresentationVisualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
January 20, 2006 Data Mining: Concepts and Techniques 12
Architecture of a Typical DM System
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
7
January 20, 2006 Data Mining: Concepts and Techniques 13
Data Mining Functionalities (1)
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
contains(T, “computer”) contains(x, “software”) [1%, 75%]
January 20, 2006 Data Mining: Concepts and Techniques 14
Data Mining Functionalities (2)
Classification and PredictionFinding models (functions) that describe and distinguish classes or concepts for future predictionE.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision-tree, classification rule, neural network
Prediction: Predict some unknown or missing numerical values
Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
8
January 20, 2006 Data Mining: Concepts and Techniques 15
Data Mining Functionalities (3)
Outlier analysisOutlier: a data object that does not comply with the general behavior of the
data
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
Trend and evolution analysisTrend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
January 20, 2006 Data Mining: Concepts and Techniques 16
Why Data Preprocessing?
Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data
9
January 20, 2006 Data Mining: Concepts and Techniques 17
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also get automotive services done
Applications* ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)Home Electronics ⇒ * (What other products should the store stocks up?)Attached mailing in direct marketingDetecting “ping-pong”ing of patients, faulty “collisions”
January 20, 2006 Data Mining: Concepts and Techniques 18
Rule Measures: Support & Confidence
Find all the rules X & Y ⇒ Z with minimum confidence and support
support, s, probability that a transaction contains {X Y Z}confidence, c, conditional probabilitythat a transaction having {X Y} also contains Z
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we have
A ⇒ C (50%, 66.6%)C ⇒ A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
10
January 20, 2006 Data Mining: Concepts and Techniques 19
Visualization of Association Rule Using Plane Graph
January 20, 2006 Data Mining: Concepts and Techniques 20
Visualization of Association Rule Using Rule Graph
11
January 20, 2006 Data Mining: Concepts and Techniques 21
Rule Mining: A Road MapRule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values handled)buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%]age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysis
What brands of beers are associated with what brands of diapers?Various extensions
Correlation, causality analysisAssociation does not necessarily imply correlation or causality
Maxpatterns and closed itemsetsConstraints enforced
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
January 20, 2006 Data Mining: Concepts and Techniques 22
Mining Association RulesMining Association Rules——An An ExampleExample
For rule A ⇒ C:
support = support({A &C}) = 50%confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
12
January 20, 2006 Data Mining: Concepts and Techniques 23
Mining FrequentMining Frequent ItemsetsItemsets: the Key Step: the Key Step
Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
January 20, 2006 Data Mining: Concepts and Techniques 24
The The Apriori Apriori AlgorithmAlgorithm
Join Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetPseudo-code:
Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;for each transaction t in database do
increment the count of all candidates in Ck+1that are contained in t
Lk+1 = candidates in Ck+1 with min_supportend
return ∪k Lk;
13
January 20, 2006 Data Mining: Concepts and Techniques 25
TheThe AprioriApriori Algorithm Algorithm —— ExampleExample
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
January 20, 2006 Data Mining: Concepts and Techniques 26
Candidates GenerationCandidates Generation
Suppose the items in Lk-1 are listed in an orderStep 1Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2Step 2: pruning
forall itemsets c in Ck doforall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
14
January 20, 2006 Data Mining: Concepts and Techniques 27
How to Count Supports of Candidates?How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very hugeOne transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction
January 20, 2006 Data Mining: Concepts and Techniques 28
Example of Generating CandidatesExample of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
15
January 20, 2006 Data Mining: Concepts and Techniques 29
Mining DistanceMining Distance--based Association Rulesbased Association Rules
Binning methods do not capture the semantics of interval data
Distance-based partitioning, more meaningful discretization considering:
density/number of points in an interval“closeness” of points in an interval
Price($)Equi-width(width $10)
Equi-depth(depth 2)
Distance-based
7 [0,10] [7,20] [7,7]20 [11,20] [22,50] [20,22]22 [21,30] [51,53] [50,53]50 [31,40]51 [41,50]53 [51,60]
January 20, 2006 Data Mining: Concepts and Techniques 30
S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set XThe diameter of S[X]:
distx:distance metric, e.g. Euclidean distance or Manhattan
)1(
])[],[(])[( 1 1
−=∑ ∑= =
NN
XtXtdistXSd
jiN
i
N
jX
Clusters and Distance MeasurementsClusters and Distance Measurements
16
January 20, 2006 Data Mining: Concepts and Techniques 31
The diameter, d, assesses the density of a cluster CX , where
Finding clusters and distance-based rules
the density threshold, d0 , replaces the notion of support
modified version of the BIRCH clustering algorithm
XdCd X 0)( ≤
0sCX ≥
Clusters and Distance Measurements(Cont.)Clusters and Distance Measurements(Cont.)
January 20, 2006 Data Mining: Concepts and Techniques 32
Interestingness MeasurementsInterestingness Measurements
Objective measures
Two popular measurements: support; and confidence
Subjective measures (Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)
17
January 20, 2006 Data Mining: Concepts and Techniques 33
Criticism to Support and ConfidenceCriticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98)
Among 5000 students3000 play basketball3750 eat cereal2000 both play basket ball and eat cereal
play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which ishigher than 66.7%.play basketball ⇒ not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
January 20, 2006 Data Mining: Concepts and Techniques 34
Criticism to Support and Confidence Criticism to Support and Confidence
Example 2:X and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates
We need a measure of dependent or correlated events
P(B|A)/P(B) is also called the lift of rule A => B
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37,50% 75%)()(
)(, BPAP
BAPcorr BA∪
=
18
January 20, 2006 Data Mining: Concepts and Techniques 35
Other Interestingness Measures: InterestOther Interestingness Measures: Interest
Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1; otherwise A and B
positively correlated
)()()(
BPAPBAP ∧
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57
January 20, 2006 Data Mining: Concepts and Techniques 36
Association Rules:SummaryAssociation Rules:Summary
Association rule mining
probably the most significant contribution from the database community in KDD
A large number of papers have been publishedMany interesting issues have been explored
An interesting research direction
Association analysis in other types of data: spatial data, multimedia data, time series data, etc.
19
January 20, 2006 Data Mining: Concepts and Techniques 37
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
January 20, 2006 Data Mining: Concepts and Techniques 38
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
20
January 20, 2006 Data Mining: Concepts and Techniques 39
Presentation of Classification Results
January 20, 2006 Data Mining: Concepts and Techniques 40
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
21
January 20, 2006 Data Mining: Concepts and Techniques 41
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected pointsDiscovers clusters of arbitrary shape in spatial databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
January 20, 2006 Data Mining: Concepts and Techniques 42
ε
ε
Reachability-distance
Cluster-orderof the objects
undefined
ε‘
22
January 20, 2006 Data Mining: Concepts and Techniques 43
Constraint-Based Clustering Analysis
Clustering analysis: less parameters but more user-desired constraints, e.g., an
ATM allocation problem
January 20, 2006 Data Mining: Concepts and Techniques 44
Mining Time-Series and Sequence Data
Time-series plot
23
January 20, 2006 Data Mining: Concepts and Techniques 45
Mining Time-Series and Sequence Data: Trend analysis
A time series can be illustrated as a time-series graph which describes a point moving with the passage of time
Categories of Time-Series Movements
Long-term or trend movements (trend curve)
Cyclic movements or cycle variations, e.g., business cycles
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.
Irregular or random movements
January 20, 2006 Data Mining: Concepts and Techniques 46
Social Impacts: Threat to Privacy and Data Security?
Is data mining a threat to privacy and data security?“Big Brother”, “Big Banker”, and “Big Business” are carefully watching youProfiling information is collected every time
You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the aboveYou surf the Web, reply to an Internet newsgroup, subscribe to amagazine, rent a video, join a club, fill out a contest entry form,You pay for prescription drugs, or present you medical care number when visiting the doctor
Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse
24
January 20, 2006 Data Mining: Concepts and Techniques 47
Protect Privacy and Data Security
Fair information practicesInternational guidelines for data privacy protectionCover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountabilityPurpose specification and use limitationOpenness: Individuals have the right to know what information is collected about them, who has access to the data, and how the data are being used
Develop and use data security-enhancing techniquesBlind signaturesBiometric encryptionAnonymous databases
January 20, 2006 Data Mining: Concepts and Techniques 48
OLAP (Summarization) Display Using MS/Excel 2000
25
January 20, 2006 Data Mining: Concepts and Techniques 49
Market-Basket-Analysis (Association)—Ball graph
January 20, 2006 Data Mining: Concepts and Techniques 50
Display of Association Rules in Rule Plane Form
26
January 20, 2006 Data Mining: Concepts and Techniques 51
Display of Decision Tree (Classification Results)
January 20, 2006 Data Mining: Concepts and Techniques 52
Display of Clustering (Segmentation) Results
27
January 20, 2006 Data Mining: Concepts and Techniques 53
3D Cube Browser
January 20, 2006 Data Mining: Concepts and Techniques 54
Trends in Data Mining (1)Trends in Data Mining (1)
Scalable data mining methodsConstraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns
Application explorationdevelopment of application-specific data mining systemInvisible data mining (mining as built-in function)
Integration of data mining with database systems, data warehousesystems, and Web database systemsQuality assessment
28
January 20, 2006 Data Mining: Concepts and Techniques 55
Trends in Data Mining (2)Trends in Data Mining (2)
Standardization of data mining languageA standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society
Visual data miningUncertainty handlingNew methods for mining complex types of data
More research is required towards the integration of data miningmethods with existing data analysis techniques for the complex types of data
Web miningPrivacy protection and information security in data mining