mining association rules between sets of items in large databases

Mining Association Rules between Sets of Items in Large Databas

presented by Zhuang Wang

Outline

• Introduction

• Formal Model

• Apriori Algorithm

• Experiments

• Summary

Introduction

• Association rule: - Association rules are used to discover elements that

co-occur frequently within a dataset consisting of multiple independent selections of elements (such as purchasing transactions), and to discover rules.

• Applications:

- Questions such as "if a customer purchases product A, how likely is he to purchase product B?" and "What products will a customer buy if he buys products C and D?" are answered by association-finding algorithms. (market basket analysis)

Formal Model

• Let I = I_1, I_2,. . ., I_n be a set of items. Let T be a database of transactions. Each transaction t in T is represented as a subset of I . Let X be a subset of I.• Support and Confidence: By an association rule, we mean an implication of the for

m X I_k, where X is a set of some items in I, and I_k is a single item in I that is not present in X.

support: probability that a transaction contains X and I_k. P(X ,I_k) confidence: conditional probability that a transaction havi

ng X also contains I_k. P(l_k | X)

Support and Confidence - Example

Transaction ID Items Bought1 A,B,C2 A,C3 A,D4 B,E,F

• Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)

Apriori Algorithm

• To find subsets which are common to at least a minimum confidence of the itemsets.

• Using a "bottom up" approach, where frequent itemsets (the sets of items that follows minimum support) are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data.

• The algorithm terminates when no further successful extensions are found.

• Generating from each large itemset, rules that use items from the large itemset

Find Frequent Itemsets - Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Experiments

• We experimented with the rule mining algorithm using the sales data obtained from a large retailing company.• There are a total of 46,873 customer transactions in this data. Each transaction contains the department numbers from which a customer bought an item in a visit.• There are a total of 63 departments. The algorithm finds if there is an association between departments in the customer purchasing behavior.

• The following rules were found for a minimum support of 1% and minimum condence of 50%.

• [Tires] [Automotive Services] (98.80, 5.79)• [Auto Accessories], [Tires] [Automotive Services] (98.2

9, 1.47)• [Auto Accessories] [Automotive Services] (79.51, 11.8

1)• [Automotive Services] [Auto Accessories] (71.60, 11.8

1)• [Home Laundry Appliances] [Maintenance Agreement

Sales] (66.55, 1.25)• [Children's Hardlines] [Infants and Children's wear] (66.

15, 4.24)• [Men's Furnishing] [Men's Sportswear] (54.86, 5.21)

Summary

• Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms.

• Hash tables: uses a hash tree to store candidate itemsets. This hash tree has item sets at the leaves and at internal nodes

• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

• Sampling: mining on a subset of given data, need a lower support threshold + a method to determine the completeness.

Reference

• R. Agrawal, T. Imielinski, A. Swami: “Mining Associations between Sets of Items in Massive Databases”, Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Washington D.C., May 1993, 207-216.

• http://knight.cis.temple.edu/~vasilis/Courses/CIS664/• http://en.wikipedia.org/wiki/Apriori_algorithm

mining association rules between sets of items in large databases

Documents

web, xml and databases€¦ · web, xml and databases vera...

legal research...

association rules l mining association rules between sets of...

a skeleton data model for geochemical databases at the...

object-oriented databasesoriented databases ·...

hp alm database best practices guide this section describes...

databases and types of databases

documentation version 2 - amazon simple storage service...

finding similar items. set similarity problem: find similar...

databases protein structure and bioinformatics...

introduction to the database mathscinet – mathematical...

data mining diabetic databases are rough sets a useful...

nosql: graph databases. databases why nosql databases?

analyzing moderately large data sets · 2009-07-03 ·...

2014 royal & langnickel art sets and accessories new items...

sets and relational databases

databases - entity-relationship modelling › units ›...

hierarchical fuzzy sets to query possibilistic databases

facsimile sets an/txc-1, -1a -1b, -1c, and -1d · and...

sustainable consumption and prodcutions (scp) targets...