efficient itemset extraction using imine index by by u.p.pushpavalli u.p.pushpavalli ii year me(cse)...

EFFICIENT ITEMSET EXTRACTION EFFICIENT ITEMSET EXTRACTION USING IMINE INDEXUSING IMINE INDEX

ByBy

U.P.PushpavalliU.P.Pushpavalli

II Year ME(CSE)II Year ME(CSE)

OBJECTIVEOBJECTIVE

The main objective is to provide an index support for The main objective is to provide an index support for frequent itemset mining.frequent itemset mining.

To provide a compact and complete structure for item set To provide a compact and complete structure for item set extraction .extraction .

Implemented by FP based and LCM based algorithms.Implemented by FP based and LCM based algorithms.

A frequent itemset is an itemset whose support is ≥ minsup

Support: For rule of form A=>B, Support refers to percentage

of transaction in D that contain AUB. Confidence: For rule of form A=>B, confidence is the conditional

probability that B is true when A is known to be true. support(LHS U RHS) / support(LHS)

Existing-Apriori AlgorithmExisting-Apriori Algorithm

Uses database scan and pattern matching to collect counts for the candidate itemsets

Any subset of a frequent itemset must be Any subset of a frequent itemset must be frequent.frequent.

Apriori –Example

TID Items10 a, c, d20 b, c, e30 a, b, c, e40 b, eMin_sup=2

Itemset Supa 2b 3c 3d 1e 3

Database D 1-candidates

Scan D

Itemset Supa 2b 3c 3e 3

Freq 1-itemsetsItemset

abacaebcbece

2-candidates

Itemset Supab 1ac 2ae 1bc 2be 3ce 2

Counting

Scan D

Itemset Supac 2bc 2be 3ce 2

Freq 2-itemsetsItemset

bce

3-candidates

Itemset Supbce 2

Freq 3-itemsets

Scan D

Bottleneck of Apriori:

Huge candidate sets Multiple scans of database

Mining Frequent Patterns- Without Candidate Generation

Large database is compressed into a compact, Frequent-Pattern tree (FP-tree) structure Highly condensed, but complete for frequent

pattern mining Avoids costly database scans Divide-and-conquer methodology Avoids candidate generation

FP-tree

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Drawbacks:

Requires two database scans

Rebuilding tree for every support count

Memory utilization high

IMINE-PROPOSEDIMINE-PROPOSED SYSTEMSYSTEM

Covering index.Covering index.

No constraints are enforced during the index creation No constraints are enforced during the index creation phase.phase.

Efficiently exploited by various item set extraction Efficiently exploited by various item set extraction algorithms.algorithms.

Physical organization supports efficient data access during Physical organization supports efficient data access during item set extraction.item set extraction.

Support item set extraction in large data sets.Support item set extraction in large data sets.

Creating I-Tree based on the FP-tree data structure

Creating I-Btree based on the B+Tree structure

Extraction task – Reading selected I-Tree portions.

Data access methods frequent-item,Support and Item-based projection

Designing IMine Physical organization to reduce I/O

Item set mining- Implementing FP-based and LCM algorithms

Performance evaluation

System Flow DiagramSystem Flow Diagram

MODULES:

Implementation of I-tree I-BtreeIMine Data Access MethodsIMine Physical OrganizationItem set mining using FP-based and LCM algorithms

Index StructureIndex Structure

Characterized by 2 components and provide 2 Characterized by 2 components and provide 2 levels of indexinglevels of indexing I-Tree (Itemset-Tree)I-Tree (Itemset-Tree)

Prefix-tree based on FP-tree data structure.Prefix-tree based on FP-tree data structure.Scans the database once.Scans the database once.

I-Btree (Item-Btree)I-Btree (Item-Btree)Reading selected I-Tree portions during Reading selected I-Tree portions during extraction .extraction .

IMineIMine

Parent pointerFirst child pointerRight brother pointer

I-TreeI-Tree

IMineIMine

I-Btree

I-TREEI-TREE

I-Tree layers:I-Tree layers: Top layerTop layer

Very frequently accessed during the mining Very frequently accessed during the mining process.process.Nodes with high support are stored.Nodes with high support are stored.

Middle layerMiddle layerQuite frequently accessed during the mining Quite frequently accessed during the mining process.process.

Bottom layerBottom layerRarely accessed during the mining processRarely accessed during the mining processNodes with unitary support are stored.Nodes with unitary support are stored.

Physical organizationPhysical organization:: Minimize the cost of reading the data needed for Minimize the cost of reading the data needed for

the current extraction processthe current extraction process Correlation types:Correlation types:

Intratransaction correlationIntratransaction correlation I-Tree layersI-Tree layers

Intertransaction correlationIntertransaction correlation I-Tree path correlationI-Tree path correlation

I/O analysis for index data access:I/O analysis for index data access: Through I-Btree, block 3 is loaded in the buffer Through I-Btree, block 3 is loaded in the buffer

cache.cache. Following the node parent, block 1 is loaded Following the node parent, block 1 is loaded

[p:3]→[d:5] →[h:7] →[e:7] →[b:10] is in memory[p:3]→[d:5] →[h:7] →[e:7] →[b:10] is in memory If the 2 blocks are still in the buffer cache, reading If the 2 blocks are still in the buffer cache, reading

other prefix path does not require additional disk other prefix path does not require additional disk readsreads

IMine data access methodIMine data access method:: Frequent-item based projectionFrequent-item based projection

Support projection-based algorithmSupport projection-based algorithm FP-growthFP-growth

Support-based projectionSupport-based projectionSupport level-based and array-based algorithmSupport level-based and array-based algorithm

Apriori and LCM v.2Apriori and LCM v.2

Item-based projectionItem-based projectionLoad all transactionsLoad all transactions

Loading frequent-item based projected DB:Loading frequent-item based projected DB: Ex: item p appears in 2 nodes [p:3] , [p:2]Ex: item p appears in 2 nodes [p:3] , [p:2]

Starting from I-Btree and reading 2 Starting from I-Btree and reading 2

prefix path for pprefix path for p

[p:3→d:5→h:7→e:7→b:10][p:3→d:5→h:7→e:7→b:10]

[p:2→i:2→h:3→e:3][p:2→i:2→h:3→e:3]

Loading Support-based projected DB:Loading Support-based projected DB:

Given the I-Tree ,subpaths between the I-Tree Given the I-Tree ,subpaths between the I-Tree roots and the first node with an infrequent item.roots and the first node with an infrequent item.

Reads a node subtree by means of a top-down Reads a node subtree by means of a top-down depth-first I-Tree visit exploiting both the node depth-first I-Tree visit exploiting both the node child and brother pointers.child and brother pointers.

Item Set MiningItem Set Mining

Step1:Step1: The needed index data is loadedThe needed index data is loaded

Step2:Step2: Item set extraction takes place on loaded dataItem set extraction takes place on loaded data

I-MINE

I_BTree

IMINE -Execution Time

IMINE-Memory Usage

Software Specification

Operating system : Windows XP/Vista

Language : JDK 1.6.1 and above

Back End : SQLServer2000

ConclusionConclusion

Provide a complete and compact representation of Provide a complete and compact representation of transactional datatransactional data

Supports different algorithmic approaches to item set Supports different algorithmic approaches to item set extractionextraction

Performance better than the existing FP-growth , Performance better than the existing FP-growth , LCM v.2 algorithms.LCM v.2 algorithms.

Future EnhancementsFuture Enhancements

Compact structure suitable for different data Compact structure suitable for different data distributionsdistributions

Incremental update of the indexIncremental update of the index

Thank YouThank You

efficient itemset extraction using imine index by by u.p.pushpavalli u.p.pushpavalli ii year me(cse)...

Documents