databases - data mining · 2010. 2. 8. · a pizza restaurant records the following sales for...

25
Databases - Data Mining (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Databases - Data Mining

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25

Page 2: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

This lecture

This lecture introduces data-mining through market-basket analysis.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 2 / 25

Page 3: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Data Mining

Data Mining

An organization that stores data about its operations will rapidlyaccumulate a vast amount of data.

For example, a supermarket might enter check-out scanner datadirectly into a database thus getting a record of all purchases made atthat supermarket.

Alternatively data might be collected for the express purpose of datamining. For example, the Busselton Project is a longitudinal study thathas accumulated 30-years of health-related data about the people inBusselton. (See http://bsn.uwa.edu.au.)

Data Mining, or more generally Knowledge Discovery in Databases(KDD) refers to the general process of trying to extract interesting oruseful patterns from a (usually huge) dataset.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 3 / 25

Page 4: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Data Mining

Rules

One of the fundamental types of interesting pattern is to identifyassociations between observations that might reflect some importantunderlying mechanism.

For example, the Busselton Project may find associations betweenhealth-related observations: perhaps a correlation between elevatedblood pressure at age 30 and the development of Type 2 diabetes atage 50.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 4 / 25

Page 5: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Data Mining

Data Mining Research

Research into data mining is one of the most active areas of currentdatabase research, with a number of different aspects:

KDD techniquesResearch into the theoretical statistical techniques underlyingKDD, such as regression, classification, clustering etc.ScalabilityResearch into algorithms for these techniques that scaleeffectively as the data volumne reaches many terabytes.IntegrationResearch into integrating KDD tools into standard databases.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 5 / 25

Page 6: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Market Basket Analysis

We will only consider one simple technique, called market basketanalysis, for finding association rules.

A market basket is a collection of items associated with a singletransaction.

The canonical example of market basket analysis is a supermarketcustomer who purchases all the items in their shopping basket.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 6 / 25

Page 7: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Market Basket Analysis

By analysing the contents of the shopping basket one may be able toinfer purchasing behavours of the customer.

The aim of market basket analysis is to analyse millions of transactionsto try and determine patterns in the items that are purchased together.

This information can then be used to guide specials, display layout,shop-a-docket vouchers, catalogues and so on.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 7 / 25

Page 8: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Sample Data

transID custID item111 201 pen111 201 ink111 201 milk111 201 juice112 105 pen112 105 ink112 105 milk113 106 pen113 106 milk114 201 pen114 201 ink114 201 juice114 201 water

(This dataset is from Chapter 26 of R & G.)

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 8 / 25

Page 9: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Terminology

An itemset is a set of one or more items: for example {pen} is anitemset, as is {milk, juice}.

The support of an itemset is the percentage of transactions in thedatabase that contain all of the items in the itemset.

For example:

The itemset {pen} has support 100%{pens} are purchased in all 4 transactions

The itemset {pen, juice} has support 50%{pen, juice} are purchased together in 2 of the 4 transactions

The itemset {pen, ink} has support 75%{pen, ink} are purchased together in 3 of the 4 transactions

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 9 / 25

Page 10: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Frequent Itemsets

The first step of mining for association rules is to identify frequentitemsets — that is, all itemsets that have support at least equal tosome user-defined minimum support.

In this example, setting minimum support to 70% we would get

Itemset Support{pen} 100%{ink} 75%{milk} 75%{pen, ink} 75%{pen, milk} 75%

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 10 / 25

Page 11: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Finding Frequent Itemsets

In a toy example like this, it is simple to just check every possiblecombination of the items, but this process does not scale very well!

However it is easy to devise a straightforward algorithm based on the apriori property

Every subset of a frequent itemset is also a frequent itemset.

The algorithm proceeds by first finding single-element frequentitemsets and then extending them, element-by-element until they areno longer frequent.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 11 / 25

Page 12: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Sample run

The first scan of our sample relation yields the three itemsets

{pen}, {ink}, {milk}

In the second step we augment each of these by an additional itemthat is itself a frequent item, and then check each of the itemsets

{pen, ink}, {milk, ink}, {pen, milk}

thereby determining that

{pen, ink}, {pen, milk}

are frequent itemsets.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 12 / 25

Page 13: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Market Basket Analysis

Sample run cont.

Now the only possible 3-item itemset to check would be

{pen, ink, milk}

but this can be rejected immediately because it contains a subset

{milk, ink}

that is not itself frequent.

With a huge database, the scan to check the frequency of eachcandidate itemset dominates the time taken and hence eliminating anitemset without a scan is very useful.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 13 / 25

Page 14: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Association rules

An association rule is an expression such as

{pen ⇒ ink}

indicating that the occurrences of “ink” in a transaction are associatedwith the occurrences of “pen”.

The overall aim of market basket analysis is to try to find associationrules in the data. If an association rule extracted from the datarepresents a genuine pattern in shopper behaviour, then this can beused in a variety of ways.

Although the terminology is all about shopper behaviour, the conceptscan easily be translated to more “significant” projects such as trying toassociate behaviour or diet with disease or mortality.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 14 / 25

Page 15: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Support of a rule

Suppose that X and Y are itemsets. Then the support of theassociation rule

X ⇒ Y

is the support of the itemset X ∪ Y.

Thus the support of the rule

{pen ⇒ ink}

is 75%.

Normally market basket analysis is only concerned with associationrules involving frequent itemsets, because while there may be a verystrong association between, say, lobster and champagne, this will notform a large proportion of sales.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 15 / 25

Page 16: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Confidence of a ruleThe confidence of a rule X ⇒ Y is the proportion of transactionsinvolving X that also involve Y.In other words, if s(X) denotes the support of X, then the confidence is

s(X ∪ Y)/s(X)

Thus in our example,{pen ⇒ ink}

has a confidence level of

75/100 = 75%,

whereas the rule{ink ⇒ pen}

has confidence 100%.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 16 / 25

Page 17: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Beer and Nappies

One of the most well-known marketing stories is the (possiblyapocryphal) story of beer and nappies.

A Wal-Mart manager noticed one Friday that a lot of customers werebuying both beer and nappies. Analysing past transaction data showedthat while beer and nappies were not particularly associated during theweek, there was a sudden upsurge in the association on Fridayevenings.

Thinking about why there might be this association, the managerconcluded that because nappies are heavy and bulky, the job of buyingnappies was often left to fathers who picked them up after work onFridays, and also stocked up on beer for the weekend.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 17 / 25

Page 18: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Cross-selling

The manager responded to this information by putting the premiumbeer displays and specials right next to the nappy aisle.

The fathers who previously bought regular beer were now encouragedto buy the premium beer, and some of the fathers who hadn’t eventhought about beer started to buy it.

This version of the story paraphrased from http:

//www.information-drivers.com/market_basket_analysis.htm.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 18 / 25

Page 19: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Interpreting association rules

The 100% confidence level indicates that the data shows that ifshoppers buy ink, then they always buy a pen as well.

How should an association rule of this type be interpreted?

Clearly there is a high correlation between buying pens and buying ink.When faced with a high correlation, it is tempting, but incorrect, toassume that the rule indicates a causal relationship.

“Buying ink causes people to buy pens”

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 19 / 25

Page 20: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Example scenarioA Pizza restaurant records the following sales for pizzas with extratoppings, in various combinations. The toppings are mushrooms (M),pepperoni (P) and extra cheese (C).

Menu Pizza ExtraItem Sales Toppings

1 100 M2 150 P3 200 C4 400 M & P5 300 M & C6 200 P & C7 100 M, P & C8 550 None

Total 2000

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 20 / 25

Page 21: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Required analysis

Complete a market-basket analysis to answer the following questions

1 Find the frequent itemsets with minimum support 40%.2 Find association rules with minimum confidence 50%.3 What is the strongest inference we can make about consumer

behaviour when choosing extra toppings?

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 21 / 25

Page 22: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Support - single item sets

1 Find the frequent itemsets with minimum support 40%.

{M} =100 + 400 + 300 + 100

2000= 45%

{P} =150 + 400 + 200 + 100

2000= 42.5%

{C} =200 + 300 + 200 + 100

2000= 40%

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 22 / 25

Page 23: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Support - two item sets

1 Find the frequent itemsets with minimum support 40%.

{M, P} =400 + 100

2000= 25%

{M, C} =300 + 100

2000= 20%

{P, C} =200 + 100

2000= 15%

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 23 / 25

Page 24: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Rule confidence

1 Find association rules with minimum confidence 50%.

M → P ={M, P}{M}

=2545

= 55.6%

P → M ={M, P}{P}

=25

42.5= 58.8%

C → M ={M, C}{C}

=2040

= 50%

All other associations are < 50%

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 24 / 25

Page 25: Databases - Data Mining · 2010. 2. 8. · A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M),

Association Rules

Inference

1 What is the strongest inference we can make about consumerbehaviour when choosing extra toppings?

People who order a pizza with extra pepperoni are likely to order extramushrooms.

(GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 25 / 25