5 detailed

Upload: shubhamg823

Post on 14-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 5 Detailed

    1/55

    Mining Association Rulesin Large Databases

  • 7/30/2019 5 Detailed

    2/55

    Association Rule Mining

    Given a set of transactions, find rules that willpredict the occurrence of an item based on theoccurrences of other items in the transaction

    Market-Basket transactions

    TID I tems

    1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

    Example of Association Rules

    {Diaper} {Beer},

    {Milk, Bread} {Eggs,Coke},

    {Beer, Bread} {Milk},

    Implication means co-occurrence,

    not causality!

  • 7/30/2019 5 Detailed

    3/55

    Definition: Frequent Itemset

    Itemset

    A collection of one or more items Example: {Milk, Bread, Diaper}

    k-itemset

    An itemset that contains k items

    Support count () Frequency of occurrence of an

    itemset

    E.g. ({Milk, Bread,Diaper}) = 2

    Support

    Fraction of transactions that

    contain an itemset E.g. s({Milk, Bread, Diaper}) =

    2/5

    Frequent Itemset

    An itemset whose support is

    greater than or equal to a minsupthreshold

    TID I tems

    1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke

    4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

    I assume that itemsets areordered lexicographically

  • 7/30/2019 5 Detailed

    4/55

    Definition: Association Rule

    Let D be database oftransactions

    e.g.:

    Let I be the set of items that appear in the

    database, e.g., I={A,B,C,D,E,F} A rule is defined by X Y, where XI,

    YI, and XY=

    e.g.: {B,C} {E} is a rule

    Transaction ID Items Bought

    2000 A,B,C

    1000 A,C4000 A,D

    5000 B,E,F

  • 7/30/2019 5 Detailed

    5/55

    Definition: Association Rule

    Example:

    Beer}Diaper,Milk{

    4.052

    |T|)BeerDiaper,,Milk( s

    67.03

    2

    )Diaper,Milk(

    )BeerDiaper,Milk,(

    c

    Association Rule

    An implication expression of theform X Y, where X and Y areitemsets

    Example:

    {Milk, Diaper} {Beer}

    Rule Evaluation Metrics

    Support (s)

    Fraction of transactions that

    contain both X and Y Confidence (c)

    Measures how often items in Yappear in transactions thatcontain X

    TID I tems

    1 Bread, Milk

    2 Bread, Diaper, Beer, Eggs

    3 Milk, Diaper, Beer, Coke

    4 Bread, Milk, Diaper, Beer

    5 Bread, Milk, Diaper, Coke

  • 7/30/2019 5 Detailed

    6/55

    Rule Measures: Support and

    ConfidenceFind all the rulesX Ywith

    minimum confidence andsupport support, s, probability that a

    transaction contains {X Y}

    confidence, c,conditionalprobability that a transactionhaving X also contains Y

    Transaction ID Items Bought

    2000 A,B,C

    1000 A,C

    4000 A,D

    5000 B,E,F

    Let minimum support 50%,

    and minimum confidence50%, we have

    A C (50%, 66.6%)

    C A (50%, 100%)

    Customer

    buys diaper

    Customerbuys both

    Customer

    buys beer

  • 7/30/2019 5 Detailed

    7/55

    TID date items_bought100 10/10/99 {F,A,D,B}

    200 15/10/99 {D,A,C,E,B}

    300 19/10/99 {C,A,B,E}

    400 20/10/99 {B,A,D}

    Example

    What is the support and confidence of therule: {B,D} {A}

    Support:

    percentage of tuples that contain {A,B,D} =

    Confidence:

    D}{B,containthattuplesofnumber

    D}B,{A,containthattuplesofnumber

    75%

    100%

    Remember:conf(X Y) =

    sup(X)

    Y)sup(X

  • 7/30/2019 5 Detailed

    8/55

    Association Rule Mining Task

    Given a set of transactions T, the goal ofassociation rule mining is to find all ruleshaving

    support minsup threshold

    confidence minconfthreshold

    Brute-force approach:

    List all possible association rules

    Compute the support and confidence for eachrule

    Prune rules that fail the minsup and minconfthresholds

    Computationally prohibitive!

  • 7/30/2019 5 Detailed

    9/55

    Mining Association Rules

    Example of Rules:

    {Milk,Diaper} {Beer} (s=0.4, c=0.67)

    {Milk,Beer} {Diaper} (s=0.4, c=1.0)

    {Diaper,Beer} {Milk} (s=0.4, c=0.67)

    {Beer} {Milk,Diaper} (s=0.4, c=0.67){Diaper} {Milk,Beer} (s=0.4, c=0.5)

    {Milk} {Diaper,Beer} (s=0.4, c=0.5)

    TID I tems

    1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke

    4 Bread, Milk, Diaper, Beer5 Bread, Milk, Diaper, Coke

    Observations:

    All the above rules are binary partitions of the same itemset:

    {Milk, Diaper, Beer}

    Rules originating from the same itemset have identical support but

    can have different confidence

    Thus, we may decouple the support and confidence requirements

  • 7/30/2019 5 Detailed

    10/55

    Mining Association Rules

    Two-step approach:

    1. Frequent Itemset Generation

    Generate all itemsets whose support minsup

    2. Rule Generation

    Generate high confidence rules from each frequentitemset, where each rule is a binary partitioning of afrequent itemset

    Frequent itemset generation is stillcomputationally expensive

  • 7/30/2019 5 Detailed

    11/55

    Frequent Itemset Generation

    null

    AB AC AD AE BC BD BE CD CE DE

    A B C D E

    ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

    ABCD ABCE ABDE ACDE BCDE

    ABCDE

    Given d items, there

    are 2d possible

    candidate itemsets

  • 7/30/2019 5 Detailed

    12/55

    Frequent Itemset Generation Brute-force approach:

    Each itemset in the lattice is a candidate frequent itemset

    Count the support of each candidate by scanning thedatabase

    Match each transaction against every candidate

    Complexity ~ O(NMw) => Expensive since M = 2d!!!

    TID I tems

    1 Bread, Milk

    2 Bread, Diaper, Beer, Eggs

    3 Milk, Diaper, Beer, Coke

    4 Bread, Milk, Diaper, Beer

    5 Bread, Milk, Diaper, Coke

    N

    Transactions List of

    Candidates

    M

    w

  • 7/30/2019 5 Detailed

    13/55

    Computational ComplexityGiven d unique items:

    Total number of itemsets = 2d

    Total number of possible association rules:

    1231

    1

    1 1

    dd

    d

    k

    kd

    j j

    kd

    k

    d

    R

    If d=6, R = 602 rules

  • 7/30/2019 5 Detailed

    14/55

    Frequent Itemset Generation Strategies

    Reduce the number of candidates (M) Complete search: M=2d

    Use pruning techniques to reduce M

    Reduce the number of transactions (N)

    Reduce size of N as the size of itemsetincreases

    Used by DHP and vertical-based miningalgorithms

    Reduce the number of comparisons (NM) Use efficient data structures to store the

    candidates or transactions

    No need to match every candidate against

    every transaction

  • 7/30/2019 5 Detailed

    15/55

    Reducing Number of Candidates

    Apriori principle: If an itemset is frequent, then all of its subsets

    must also be frequent

    Apriori principle holds due to the followingproperty of the support measure:

    Support of an itemset never exceeds the supportof its subsets

    This is known as the anti-monotone property of

    support

    )()()(:, YsXsYXYX

  • 7/30/2019 5 Detailed

    16/55

    Example

    TID I tems

    1 Bread, Milk

    2 Bread, Diaper, Beer, Eggs

    3 Milk, Diaper, Beer, Coke

    4 Bread, Milk, Diaper, Beer

    5 Bread, Milk, Diaper, Coke

    s(Bread) > s(Bread, Beer)

    s(Milk) > s(Bread, Milk)

    s(Diaper, Beer) > s(Diaper, Beer, Coke)

  • 7/30/2019 5 Detailed

    17/55

  • 7/30/2019 5 Detailed

    18/55

    Illustrating Apriori Principle

    Item Count

    Bread 4Coke 2Milk 4Beer 3

    Diaper 4Eggs 1

    Itemset Count

    {Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

    Itemset Count

    {Bread,Milk,Diaper} 3

    Items (1-itemsets)

    Pairs (2-itemsets)

    (No need to generatecandidates involving Cokeor Eggs)

    Triplets (3-itemsets)

    Minimum Support = 3

    If every subset is considered,6C1 +

    6C2 +6C3 = 41

    With support-based pruning,6 + 6 + 1 = 13

  • 7/30/2019 5 Detailed

    19/55

    The Apriori Algorithm (the general idea)

    1. Find frequent 1-items and put them to Lk (k=1)

    2. Use Lk to generate a collection ofcandidateitemsets Ck+1 with size (k+1)

    3. Scan the database to find which itemsets in Ck+1are frequent and put them into Lk+1

    4. If Lk+1 is not empty

    k=k+1

    GOTO 2

    R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",

    Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.

  • 7/30/2019 5 Detailed

    20/55

    The Apriori Algorithm

    Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size k

    L1 = {frequent items};for(k= 1; Lk !=; k++) do begin

    Ck+1 = candidates generated from Lk;// join and prune stepsfor each transaction tin database do

    increment the count of all candidates in Ck+1that are contained in t

    Lk+1 = candidates in Ck+1 with min_support (frequent)end

    returnkLk;

    Important steps in candidate generation: Join Step: Ck+1is generated by joining Lk with itself

    Prune Step: Any k-itemset that is not frequent cannot bea subset of a frequent (k+1)-itemset

  • 7/30/2019 5 Detailed

    21/55

    The Apriori AlgorithmExample

    TID Items

    100 1 3 4

    200 2 3 5

    300 1 2 3 5

    400 2 5

    Database D itemset sup.{1} 2

    {2} 3

    {3} 3

    {4} 1

    {5} 3

    itemset sup.

    {1} 2

    {2} 3

    {3} 3

    {5} 3

    Scan D

    C1L1

    itemset

    {1 2}

    {1 3}

    {1 5}

    {2 3}{2 5}

    {3 5}

    itemset sup

    {1 2} 1

    {1 3} 2

    {1 5} 1

    {2 3} 2

    {2 5} 3

    {3 5} 2

    itemset sup

    {1 3} 2

    {2 3} 2

    {2 5} 3

    {3 5} 2

    L2

    C2 C2Scan D

    C3 L3itemset

    {2 3 5}

    Scan D itemset sup

    {2 3 5} 2

    min_sup=2=50%

  • 7/30/2019 5 Detailed

    22/55

    How to Generate Candidates?

    Suppose the items in Lkare listed in an order

    Step 1: self-joining Lk(IN SQL)

    insert intoCk+1

    selectp.item1, p.item2, , p.itemk, q.itemkfrom Lkp, Lkq

    wherep.item1=q.item1, , p.itemk-1=q.itemk-1,p.itemk{ }

    BCD=>A ACD=>B ABD=>C ABC=>D

    BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

    D=>ABC C=>ABD B=>ACD A=>BCD

    Lattice of rules

    ABCD=>{ }

    BCD=>A ACD=>B ABD=>C ABC=>D

    BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

    D=>ABC C=>ABD B=>ACD A=>BCD

    Pruned

    Rules

    Low

    Confidence

    Rule

    R le Generation for Apriori

  • 7/30/2019 5 Detailed

    40/55

    Rule Generation for Apriori

    Algorithm

    Candidate rule is generated by mergingtwo rules that share the same prefixin the rule consequent

    join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC

    Prune rule D=>ABC if itssubset AD=>BC does not have

    high confidence

    BD=>ACCD=>AB

    D=>ABC

    A i i F h?

  • 7/30/2019 5 Detailed

    41/55

    Is Apriori Fast Enough?

    Performance Bottlenecks

    The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-

    itemsets

    Use database scan and pattern matching to collect counts for

    the candidate itemsets

    The bottleneck ofApriori: candidate generation

    Huge candidate sets:

    104 frequent 1-itemset will generate 107 candidate 2-itemsets

    To discover a frequent pattern of size 100, e.g., {a1, a2, ,a100}, one needs to generate 2

    100 1030 candidates.

    Multiple scans of database:

    Needs (n +1 ) scans, n is the length of the longest pattern

    FP h Mi i F P

  • 7/30/2019 5 Detailed

    42/55

    FP-growth: Mining Frequent Patterns

    Without Candidate Generation

    Compress a large database into a compact,

    Frequent-Pattern tree (FP-tree) structure

    highly condensed, but complete for frequent pattern

    mining avoid costly database scans

    Develop an efficient, FP-tree-based frequent

    pattern mining method

    A divide-and-conquer methodology: decompose miningtasks into smaller ones

    Avoid candidate generation: sub-database test only!

    FP tree Construction from a

  • 7/30/2019 5 Detailed

    43/55

    FP-tree Construction from a

    Transactional DB

    I tem frequency

    f 4c 4a 3b 3

    m 3p 3

    min_support = 3TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

    Steps:

    1. Scan DB once, find frequent 1-itemsets (single

    item patterns)2. Order frequent items in descending order of

    their frequency

    3. Scan DB again, construct FP-tree

  • 7/30/2019 5 Detailed

    44/55

    FP-tree Construction

    root

    TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

    I tem frequency

    f 4c 4a 3b 3

    m 3p 3

    min_support = 3

    f:1

    c:1

    a:1

    m:1

    p:1

  • 7/30/2019 5 Detailed

    45/55

    FP-tree Construction

    root

    I tem frequency

    f 4c 4a 3b 3

    m 3p 3

    min_support = 3

    f:2

    c:2

    a:2

    m:1

    p:1

    b:1

    m:1

    TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

  • 7/30/2019 5 Detailed

    46/55

    FP-tree Construction

    root

    I tem fr equency

    f 4c 4a 3b 3

    m 3p 3

    min_support = 3

    f:3

    c:2

    a:2

    m:1

    p:1

    b:1

    m:1

    b:1

    TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

    c:1

    b:1

    p:1

  • 7/30/2019 5 Detailed

    47/55

    FP-tree Construction

    root

    I tem frequency

    f 4c 4a 3b 3

    m 3p 3

    min_support = 3

    f:4

    c:3

    a:3

    m:2

    p:2

    b:1

    m:1

    b:1

    TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

    c:1

    b:1

    p:1

    Header TableI tem frequency head

    f 4c 4a 3

    b 3m 3

    p 3

  • 7/30/2019 5 Detailed

    48/55

    Benefits of the FP-tree Structure

    Completeness: never breaks a long pattern of any transaction

    preserves complete information for frequent pattern mining

    Compactness reduce irrelevant informationinfrequent items are gone

    frequency descending ordering: more frequent items aremore likely to be shared

    never be larger than the original database (if not count

    node-links and counts) Example: For Connect-4 DB, compression ratio could be

    over 100

    Mi i F P U i

  • 7/30/2019 5 Detailed

    49/55

    Mining Frequent Patterns Using

    FP-tree

    General idea (divide-and-conquer)

    Recursively grow frequent pattern path using the FP-tree

    Method

    For each item, construct its conditional pattern-base, andthen its conditional FP-tree

    Repeat the process on each newly created conditional FP-tree

    Until the resulting FP-tree is empty, or it containsonly

    one path(single path will generate all the combinations of itssub-paths, each of which is a frequent pattern)

  • 7/30/2019 5 Detailed

    50/55

    Mining Frequent Patterns Using the FP-tree(contd)

    Start with last item in order (i.e., p). Follow node pointers and traverse only the paths containing p. Accumulate all of transformed prefix paths of that item to form

    a conditional pattern base

    Conditionalpattern base for p

    fcam:2, cb:1

    f:4

    c:3

    a:3

    m:2

    p:2

    c:1

    b:1

    p:1

    p

    Construct a new FP-tree based

    on this pattern, by merging allpaths and keeping nodes thatappear sup times. This leads toonly one branch c:3Thus we derive only one frequentpattern cont.p. Pattern cp

  • 7/30/2019 5 Detailed

    51/55

    Mining Frequent Patterns Using the FP-tree(contd)

    Move to next least frequent item in order, i.e., m Follow node pointers and traverse only the paths containing m. Accumulate all of transformed prefix paths of that item to form

    a conditional pattern base

    f:4

    c:3

    a:3

    m:2

    m

    m:1

    b:1

    m-conditionalpattern base:

    fca:2, fcab:1

    {}

    f:3c:3

    a:3m-conditionalFP-tree (contains only path fca:3)

    All frequent patternsthat include m

    m,

    fm, cm, am,

    fcm, fam, cam,

    fcam

  • 7/30/2019 5 Detailed

    52/55

    Properties of FP-tree for Conditional Pattern

    Base Construction

    Node-link property

    For any frequent item ai,all the possible frequent patterns

    that contain aican be obtained by following ai's node-links,

    starting from ai's head in the FP-tree header

    Prefix path property

    To calculate the frequent patterns for a node ai in a path P,

    only the prefix sub-path ofai in P need to be accumulated,

    and its frequency count should carry the same count as

    node ai.

  • 7/30/2019 5 Detailed

    53/55

    Conditional Pattern-Bases for the example

    EmptyEmptyf

    {(f:3)}|c{(f:3)}c

    {(f:3, c:3)}|a{(fc:3)}a

    Empty{(fca:1), (f:1), (c:1)}b

    {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

    {(c:3)}|p{(fcam:2), (cb:1)}p

    Conditional FP-treeConditional pattern-baseItem

  • 7/30/2019 5 Detailed

    54/55

    Why Is Frequent Pattern Growth Fast?

    Performance studies show

    FP-growth is an order of magnitude faster than Apriori,

    and is also faster than tree-projection

    Reasoning

    No candidate generation, no candidate test

    Uses compact data structure

    Eliminates repeated database scan

    Basic operation is counting and FP-tree building

    FP growth vs Apriori: Scalability With

  • 7/30/2019 5 Detailed

    55/55

    FP-growth vs. Apriori: Scalability Withthe Support Threshold

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0 0.5 1 1.5 2 2.5 3

    Support threshold(%)

    Runtime(sec.)

    D1 FP-grow th runtime

    D1 Apriori runtime

    Data set T25I20D10K