Download - Market Basket Analysis
Market Basket Analysis and Association Rules
2
What can be inferred
I purchase diapersI purchase diapers
I purchase a new carI purchase a new car
I purchase OTC cough medicineI purchase OTC cough medicine
I purchase a prescription I purchase a prescription
medicationmedication
I donrsquot show up for classI donrsquot show up for class
3
Market Basket Analysis
MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data
3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices
purchased)purchased)
4
Market Basket Analysis
Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes
MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)
Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion
Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip
Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable
5
Association Rules
DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis
AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data
without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data
miningmining
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
2
What can be inferred
I purchase diapersI purchase diapers
I purchase a new carI purchase a new car
I purchase OTC cough medicineI purchase OTC cough medicine
I purchase a prescription I purchase a prescription
medicationmedication
I donrsquot show up for classI donrsquot show up for class
3
Market Basket Analysis
MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data
3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices
purchased)purchased)
4
Market Basket Analysis
Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes
MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)
Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion
Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip
Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable
5
Association Rules
DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis
AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data
without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data
miningmining
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
3
Market Basket Analysis
MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data
3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices
purchased)purchased)
4
Market Basket Analysis
Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes
MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)
Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion
Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip
Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable
5
Association Rules
DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis
AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data
without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data
miningmining
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
4
Market Basket Analysis
Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes
MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)
Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion
Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip
Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable
5
Association Rules
DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis
AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data
without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data
miningmining
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
5
Association Rules
DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis
AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data
without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data
miningmining
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
6
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
7
Market Basket Analysis Measures
Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent
Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)
Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)
Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))
Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
8
Association Rules Apply Elsewhere
Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit
cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance
claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
9
A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence
Given a set of task Given a set of task
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
10
Typical Data Structure (Relational Database)
Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product
What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder
EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext
slide slide
Transaction Data
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
11
Sales Order Characteristics
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
12
Sales Order Characteristics
Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a
one-item orderone-item order What is the most common item found on a What is the most common item found on a
multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat
customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over
timetime How does the ordering of an item vary How does the ordering of an item vary
geographicallygeographically
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
13
Pivoting for Cluster Algorithms
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
14
Association Rules
Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]
Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
15
Association Rules
Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-
quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already
well-known by those familiar with well-known by those familiar with the businessthe business
Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action
Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
16
How Good is an Association Rule
CustomerCustomer Items PurchasedItems Purchased
11 Coke sodaCoke soda
22 Milk Coke window cleanerMilk Coke window cleaner
33 Coke detergentCoke detergent
44 Coke detergent sodaCoke detergent soda
55 Window cleaner sodaWindow cleaner soda
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
CokeCoke 44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
POS Transactions
Co-occurrence ofProducts
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
17
How Good is an Association Rule
CokCokee
Window Window cleanercleaner
MilkMilk SodaSoda DetergentDetergent
44 11 11 22 22
Window cleanerWindow cleaner 11 22 11 11 00
MilkMilk 11 11 11 00 00
SodaSoda 22 11 00 33 11
DetergentDetergent 22 00 00 11 22
Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
18
How Good is an Association Rule
What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67
What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50
Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items
Customer Items Purchased
1 Coke soda
2 Milk Coke window cleaner
3 Coke detergent
4 Coke detergent soda
5 Window cleaner soda
POS Transactions
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
19
How Good is an Association Rule
How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at
predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace
Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products
Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing
When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing
Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
20
Creating Association Rules
11 Choosing the right set Choosing the right set of itemsof items
22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix
33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
21
Overcoming Practical Limits for Association Rules
11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo
22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo
33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda
44 EtchellipEtchellip
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
22
Final Thought on Association RulesThe Problem of Lots of Data
Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3
different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique
itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations
Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue
Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
23
Business and other cases
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
24
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
25
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
26
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
27
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
28
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
29
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
30
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
31
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
32
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
33
General Observations
Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or
activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time
As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
34
In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules
Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
35
Challenges
A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business
The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You
36
Solutions
Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results
Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications
37
Thank You