data mining toon calders
DESCRIPTION
Data Mining Toon Calders. Why Data mining?. Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Major sources of abundant data. Why Data mining?. We are drowning in data, but starving for knowledge! - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/1.jpg)
Data Mining
Toon Calders
![Page 2: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/2.jpg)
Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
– Major sources of abundant data
Why Data mining?
![Page 3: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/3.jpg)
Why Data mining?
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of analysts
![Page 4: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/4.jpg)
What Is Data Mining?
Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
![Page 5: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/5.jpg)
Data analysis and decision support
– Market analysis and management
– Risk analysis and management
– Fraud detection and detection of unusual patterns (outliers)
Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
Current Applications
![Page 6: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/6.jpg)
Process mining can be used for:– Process discovery (What is the process?)
– Delta analysis (Are we doing what was specified?)
– Performance analysis (How can we improve?)
process mining
Registerorder
Prepareshipment
Shipgoods
Receivepayment
(Re)sendbill
Contactcustomer
Archiveorder
Ex. 3: Process Mining
![Page 7: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/7.jpg)
Ex. 3: Process Mining
case 1 : task A case 2 : task A case 3 : task A case 3 : task B case 1 : task B case 1 : task C case 2 : task C case 4 : task A case 2 : task B case 2 : task D case 5 : task E case 4 : task C case 1 : task D case 3 : task C case 3 : task D case 4 : task B case 5 : task F case 4 : task D
A
B
C
D
E F
![Page 8: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/8.jpg)
Data Mining Tasks
Previous lectures:– Classification [Predictive]
– Clustering [Descriptive]
This lecture:– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
Other techniques:– Regression [Predictive]
– Deviation Detection [Predictive]
![Page 9: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/9.jpg)
Outline of today’s lecture
Association Rule Mining– Frequent itemsets and association rules
– Algorithms: Apriori and Eclat
Sequential Pattern Mining– Mining frequent episodes
– Algorithms: WinEpi and MinEpi
Other types of patterns– strings, graphs, …
– process mining
![Page 10: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/10.jpg)
Association Rule Mining
Definition– Frequent itemsets
– Association rules
Frequent itemset mining– breadth-first Apriori
– depth-first Eclat
Association Rule Mining
![Page 11: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/11.jpg)
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
![Page 12: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/12.jpg)
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}
– k-itemset An itemset that contains k items
Support count ()– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater
than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 13: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/13.jpg)
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule– An implication expression of the form
X Y, where X and Y are itemsets
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
– Confidence (c) Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 14: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/14.jpg)
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
![Page 15: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/15.jpg)
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
![Page 16: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/16.jpg)
Mining Association Rules
Two-step approach: 1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
![Page 17: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/17.jpg)
Association Rule Mining
Definition– Frequent itemsets
– Association rules
Frequent itemset mining– breadth-first Apriori
– depth-first Eclat
Association Rule Mining
![Page 18: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/18.jpg)
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
![Page 19: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/19.jpg)
Frequent Itemset Generation
Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List ofCandidates
M
w
![Page 20: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/20.jpg)
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d
– Use pruning techniques to reduce M
Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases– Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or
transactions– No need to match every candidate against every
transaction
![Page 21: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/21.jpg)
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following property of the support measure:
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
)()()(:, YsXsYXYX
![Page 22: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/22.jpg)
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
![Page 23: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/23.jpg)
Illustrating Apriori Principle
Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count {Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 1 = 13
![Page 24: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/24.jpg)
Association Rule Mining
Definition– Frequent itemsets
– Association rules
Frequent itemset mining– breadth-first Apriori
– depth-first Eclat
Association Rule Mining
![Page 25: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/25.jpg)
Apriori
A CB D
{}
minsup=2
0 0 0 0
Candidates
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 26: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/26.jpg)
Apriori
A CB D
{}
0 1 1 0
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
![Page 27: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/27.jpg)
Apriori
A CB D
{}
0 2 2 0
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
![Page 28: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/28.jpg)
Apriori
A CB D
{}
1 2 3 1
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
![Page 29: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/29.jpg)
Apriori
A CB D
{}
2 3 4 2
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
![Page 30: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/30.jpg)
Apriori
A CB D
{}
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
Candidates
![Page 31: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/31.jpg)
Apriori
AB BCAC AD CDBD
A CB D
{}
2 4 4 3
Candidates
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 32: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/32.jpg)
Apriori
AB BCAC AD CDBD
A CB D
{}
2 4 4 3
1 2 2 3 2 2
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 33: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/33.jpg)
Apriori
ACD BCD
AB BCAC AD CDBD
A CB D
{}
1 2 2 3 2 2
Candidates
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 34: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/34.jpg)
Apriori
ACD BCD
AB BCAC AD CDBD
A CB D
{}
1 2 2 3 2 2
2 1
2 4 4 3
minsup=2
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 35: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/35.jpg)
Apriori Algorithm
Apriori Algorithm:
k := 1
C1 := { {A} | A is an item}
Repeat until Ck = {}
Count the support of each candidate in Ck
– in one scan over DB
Fk := { I Ck : I is frequent}
Generate new candidates
Ck+1 := { I : |I| = k+1 and all J I with |J|=k are in Fk}
k:=k+1
Return i=1…k-1 Fi
![Page 36: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/36.jpg)
Association Rule Mining
Definition– Frequent itemsets
– Association rules
Frequent itemset mining– breadth-first Apriori
– depth-first Eclat
Association Rule Mining
![Page 37: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/37.jpg)
Depth-first strategy
Recursive procedure– FSET(DB) = frequent sets in DB
Based on divide-and-conquer– Count frequency of all items
let D be a frequent item
– FSET(DB) =
Frequent sets with item D +
Frequent sets without item D
![Page 38: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/38.jpg)
Depth-first strategy
Frequent items– A, B, C, D
Frequent sets with D:– remove transactions without D
and D itself from DB
– Count frequent sets: A, B, C, AC
– Append D: AD, BD, CD, ACD
Frequent sets without D:– remove D from all transactions in DB
– Find frequent sets: AC, BC
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
![Page 39: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/39.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
DBminsup=2
![Page 40: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/40.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DBminsup=2
![Page 41: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/41.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
DBDB[D]
minsup=2
![Page 42: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/42.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
3 A, 4 A, B
A: 2
DBDB[D]
DB[CD]
minsup=2
![Page 43: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/43.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
3 A, 4 A, B
A: 2
DBDB[D]
DB[CD]
AC: 2
minsup=2
![Page 44: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/44.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
DBDB[D]
AC: 2
minsup=2
![Page 45: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/45.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
DBDB[D]
AC: 2
4 A
DB[BD]
A:1
minsup=2
![Page 46: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/46.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
DBDB[D]
AC: 2
minsup=2
![Page 47: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/47.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
3 A, C4 A, B, C5 B,
A: 2B: 2C: 2
DBDB[D]
AC: 2
AD: 2BD: 2CD: 2ACD: 2
minsup=2
![Page 48: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/48.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2
minsup=2
![Page 49: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/49.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
minsup=2
![Page 50: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/50.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
1
2
4 AA: 1
DB[BC]minsup=2
![Page 51: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/51.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
minsup=2
![Page 52: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/52.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2AC: 2BC: 3
1 B2 B3 A4 A, B
DB[C]
A: 2B: 3
minsup=2
![Page 53: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/53.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2AC: 2BC: 3
minsup=2
![Page 54: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/54.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2AC: 2BC: 3
1 24 A 5
DB[B]
A:1
minsup=2
![Page 55: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/55.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2AC: 2BC: 3
minsup=2
![Page 56: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/56.jpg)
Depth-First Algorithm
1 B, C2 B, C3 A, C, D4 A, B, C, D5 B, D
A: 2B: 4C: 4D: 3
DB
AD: 2BD: 2CD: 2ACD: 2AC: 2BC: 3
Final set of frequent itemsets
minsup=2
![Page 57: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/57.jpg)
Depth-first strategy
FSET(DB):
1. Count frequency of items in DB
2. F := { A | A is frequent in DB }
3. // Remove infrequent items from DB
DB := { T F : TDB }
4. For all frequent items D except last one do:
// Find frequent, strict supersets of {D} in DB:
4a. Let DB[D] := { T \ {D} | T DB, D T }
4b. F := F { (I D) : I in FSET(DB[D]) }
4c. // Remove D from DB
DB := { T \ {D} : TDB }
5. Return F
![Page 58: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/58.jpg)
Depth-first strategy
All depth-first algorithms use this strategy Difference = data structure for DB
– prefix-tree: FPGrowth
– vertical database: Eclat
![Page 59: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/59.jpg)
ECLAT
For each item, store a list of transaction ids (tids)
TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D
10 B
HorizontalData Layout
A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109
Vertical Data Layout
TID-list
![Page 60: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/60.jpg)
ECLAT
Support of item A = length of its tidlist Remove item A from DB: remove tidlist of A Create conditional database DB[E]:
– Intersect all other tidlists with the tidlist of E
– Only keep frequent items
A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9
A B C D 1 1 3 6
A B C 1 1 3 6
![Page 61: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/61.jpg)
Association Rule Mining
Definition– Frequent itemsets
– Association rules
Frequent itemset mining– breadth-first Apriori
– depth-first Eclat
Association Rule Mining
![Page 62: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/62.jpg)
Association Rule Mining
Remember:– original problem: find rules XY such that
support(XY) minsupsupport(XY) / support(X) minconf
– Frequent itemsets = the combinations XY
Hence:– Get XY by splitting up the frequent itemsets I
![Page 63: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/63.jpg)
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
![Page 64: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/64.jpg)
Rule Generation
How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-monotone
propertyc(ABC D) can be larger or smaller than c(AB
D)
– But confidence of rules generated from the same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
![Page 65: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/65.jpg)
Rule Generation for Apriori Algorithm
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rulesABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned Rules
Low Confidence Rule
![Page 66: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/66.jpg)
Summary: Association Rule Mining
Find associations X Y– rule appears in sufficient large part of the database– conditional probability P(Y | X) is high
This problem can be split into two sub-problems:– find frequent itemsets– split frequent itemsets to get association rules
Finding frequent itemsets:– Apriori-property– breadth-first vs depth-first algorithms
From itemsets to association rules– split up frequent sets, use anti-monotonicity
![Page 67: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/67.jpg)
Outline
Association Rule Mining– Frequent itemsets and association rules
– Algorithms: Apriori and Eclat
Sequential Pattern Mining– Mining frequent episodes
– Algorithms: WinEpi and MinEpi
Other types of patterns– strings, graphs, …
– process mining
![Page 68: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/68.jpg)
In many applications, the order and transaction times are very important:– stock prices
– events in a networking environmentcrash, starting a program, certain commands
Specific format of the data is very important
Goal: find “temporal rules”, order is important.
Series and Sequences
![Page 69: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/69.jpg)
Example– 70 % of the customers that buy shoes and socks, will
buy shoe polish within 5 days.
– User U1 logging on, followed by User U2 starting program P, is always followed by a crash.
Here, we will concentate on the problem of finding frequent episodes– can be used in the same way as itemsets
– split episodes to get the rules
Series and Sequences
![Page 70: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/70.jpg)
Event sequence: sequence of pairs (e,t), e is an event, t an integer indicating the time of occurrence of e.
An linear episode is a sequence of events
<e1, …, en>.
A window of length w is an interval [s,e] with
(e-s+1) = w. An episode E=<e1, …, en> occurs in sequence
S=<(s1,t1), …, (sm,tm)> within window W=[s,e] if there exist integers s i1 < … < in e such that for all j=1…n, (ej,ij) is in S.
Episode Mining
![Page 71: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/71.jpg)
Episode mining: support measure
Given a sequence SFind all linear episodes that occur frequently in S
![Page 72: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/72.jpg)
Episode mining: support measure
Given a sequence SFind all linear episodes that occur frequently in S
Given an integer w. The w-support of an episode E=<e1, …, en> in a sequence S=<(s1,t1), …, (sm,tm)> is the number of windows W of length w such that E occurs in S within window W.
Note: If an episode occurs in a very short time span, it will be in many subsequent windows, and thus contribute a lot to the support count!
![Page 73: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/73.jpg)
Example
S = < b a a c b a a b c >
E = < b a c >
E occurs in S within window [0,4], within [1,4], within [5,9], …
The 5-support of E in S is 3, since E is only in the following
windows of length 5: [0,4], [1,5], [5,9]
b a a c b a a b c
![Page 74: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/74.jpg)
An episode E1=<e1, …, en> is a sub-episode of E2=<f1,…,fm>, denoted E1 E2 if there exist integers 1 i1 < … < in m such that for all j=1…n, ej=fij.
Example< b, a, a, c > is a sub-episode of <a, b, c, a, a, b, c>.
![Page 75: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/75.jpg)
Episode Mining Problem
Given a sequence w, a minimal support minsup, and a window width w, find all episodes that have a w-support above minsup.
Monotonicity
Let S be a sequence, E1, E2 episodes, w a number.
If E1 E2, then the w-support(E2) w-support(E1).
![Page 76: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/76.jpg)
WinEpi Algorithm
We can again apply a level-wise algorithm like Apriori.
Start with small episodes, only proceed with a larger episode if all sub-episodes are frequent.<a,a,b> is evaluated after <a>, <b>, <a,a>, <a,b>, and only if all
these episodes were frequent.
Counting the frequency:– slide window over stream
– use smart update technique for the supports
![Page 77: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/77.jpg)
<a> <b> <c>
<a,a> <a,b> <a,c> <b,a> <b,b> <b,c> <c,a> <c,b> <c,c>
<a,a,a> <a,a,b> <a,a,c> <a,b,a> <a,b,b> <a,b,c> …
<a,a,a,a> <a,a,a,b> … … ……
Number of episodes of length k: ek (e is number of events)
An episode of length k has maximally k sub-sequences of length k-1.
We can count supports by sliding a window over the sequence.
Search space
![Page 78: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/78.jpg)
S = < (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b,9), (c,13), (a,14), (c,17), (c,18) >
w = 4, minsup = 3
C1 = { <a>, <b>, <c> }
a a ab b b b bc c c c
0 1 2
Example
![Page 79: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/79.jpg)
S = < (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b,9), (c,13), (a,14), (c,17), (c,18) >
w = 4, minsup = 3
C1 = { <a>, <b>, <c> }
Slide window of length 4 over S:
4-supports: <a>:12, <b>:12, <c>:14
a a ab b b b bc c c c
0 1 2
Example
![Page 80: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/80.jpg)
S = < (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b,9), (c,13), (a,14), (c,17), (c,18) >
w = 4, minsup = 3
C1 = { <a>, <b>, <c> }
Slide window of length 4 over S:
4-supports: <a>:12, <b>:12, <c>:14 C2 = { <a,a>, <a,b>, <a,c>, <b,a>, <b,b>, <b,c>, <c,a>,
<c,b>, <c,c> }
a a ab b b b bc c c c
0 1 2
Example
![Page 81: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/81.jpg)
S = < (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b,9), (c,13), (a,14), (c,17), (c,18) >
w = 4, minsup = 3
C1 = { <a>, <b>, <c> }
Slide window of length 4 over S:
4-supports: <a>:12, <b>:12, <c>:14 C2 = { <a,a>, <a,b>, <a,c>, <b,a>, <b,b>, <b,c>, <c,a>,
<c,b>, <c,c> }
4-supports: <a,a>:0 <a,b>:6 <a,c>:2 <b,a>:3
<b,b>:7 <b,c>:3 <c,a>:3 <c,b>:1 <c,c>:3
a a ab b b b bc c c c
0 1 2
Example
![Page 82: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/82.jpg)
S = < (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b,9), (c,13), (a,14), (c,17), (c,18) >
w = 4, minsup = 3
C1 = { <a>, <b>, <c> }
Slide window of length 4 over S:
4-supports: <a>:12, <b>:12, <c>:14 C2 = { <a,a>, <a,b>, <a,c>, <b,a>, <b,b>, <b,c>, <c,a>,
<c,b>, <c,c> }
4-supports: <a,a>:0 <a,b>:6 <a,c>:2 <b,a>:3
<b,b>:7 <b,c>:3 <c,a>:3 <c,b>:1 <c,c>:3 C3 = { <a,b,b>,<b,a,b>,<b,b,a>,<b,b,b>,<b,b,c>,<b,c,a>,
<b,c,c>, <c,c,a>, <c,c,c>}
4-supports: <a,b,b>:2, <b,a,b>:2, <b,b,a>:2, <b,b,b>:2,
<b,b,c>:0, <b,c,a>:0, <b,c,c>:0, <c,c,a>:0, <c,c,c>:0
a a ab b b b bc c c c
0 1 2
Example
![Page 83: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/83.jpg)
MinEpi
Very similar algorithm based on other support measure
– minimal occurrence of sequence: smallest window in which the sequence occurs
– support of E = number of minimal occurrences of E with a width less than w
S = < a b c b b a b b c a c c c b b> window length = 5
5-support of < a b b > :
mo-support of < a b b > :
![Page 84: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/84.jpg)
MinEpi
Very similar algorithm based on other support measure
– minimal occurrence of sequence: smallest window in which the sequence occurs
– support of E = number of minimal occurrences of E with a width less than w
S = < a b c b b a b b c a c c c b b> window length = 5
5-support of < a b b > : 5
a b c b b a b b c a c c c b b
mo-support of < a b b >
![Page 85: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/85.jpg)
MinEpi
Very similar algorithm based on other support measure
– minimal occurrence of sequence: smallest window in which the sequence occurs
– support of E = number of minimal occurrences of E with a width less than w
S = < a b c b b a b b c a c c c b b> window length = 5
5-support of < a b b > : 5
a b c b b a b b c a c c c b b
mo-support of < a b b > : 2
a b c b b a b b c a c c c b b
![Page 86: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/86.jpg)
Sequential Pattern Mining: Summary
Mining sequential episodes Two definitions of support:
– w-support
– mo-support
Two algorithms:– WinEpi
– MinEpi
Based on monotonicity principle– generate candidates levelwise
– only count candidates without infrequent subsequences
![Page 87: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/87.jpg)
Outline
Association Rule Mining– Frequent itemsets and association rules
– Algorithms: Apriori and Eclat
Sequential Pattern Mining– Mining frequent episodes
– Algorithms: WinEpi and MinEpi
Other types of patterns– strings, graphs, …
– process mining
![Page 88: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/88.jpg)
Other types of patterns
Sequence problems– Strings
– Other types of sequences
– Oher patterns in sequences
Graphs– Molecules
– WWW
– Social Networks
…
![Page 89: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/89.jpg)
Other Types of Sequences
CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA
![Page 90: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/90.jpg)
Other Patterns in Sequences
Substrings Regular expressions (bb|[^b]{2}) Partial orders Directed Acyclic Graphs
![Page 91: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/91.jpg)
Graphs
![Page 92: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/92.jpg)
Patterns in Graphs
![Page 93: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/93.jpg)
Rules
f: 5 f: 8
f: 4
f: 7
f: 4
0.8 0.5
f: 4
0.57
![Page 94: Data Mining Toon Calders](https://reader035.vdocuments.us/reader035/viewer/2022081421/56814588550346895db26ad6/html5/thumbnails/94.jpg)
Summary
What is data mining and why is it important.– huge volumes of data– not enough human analysts
Pattern discovery as an important descriptive data mining task– association rule mining– sequential pattern mining
Important principles:– Apriori principle– breadth-first vs depth-first algorithms
Many kinds and variaties of data-types, pattern types,support measures, …