frequent itemset mining - page d'accueil / lirmm.fr / -...
TRANSCRIPT
ì
FrequentItemsetMining
NadjibLAZAAR
LIRMM-UMCOCONUTTeam
(PARTI)
IMAGINA18/19
Webpage:http://www.lirmm.fr/~lazaar/teaching.htmlEmail:[email protected]
�1
DataMining
ì DataMining(DM)orKnowledgeDiscoveryinDatabases(KDD)revolvesaroundtheinvestigationandcreationofknowledge,processes,algorithms,andthemechanismsforretrievingpotentialknowledgefromdatacollections.
�2
GameDataMining
ì Dataaboutplayersbehavior,serverperformance,systemfunctionality…
ì Howtoconvertthesedataintosomethingmeaningful?
ì Howtomovefromrawdatatoactionableinsights?
➔ Gamedataminingistheanswer
�3
FrequentItemsetMining:Motivations
FrequentItemsetMiningisamethodformarketbasketanalysis.
Itaimsatfindingregularitiesintheshoppingbehaviorofcustomersofsupermarkets,mail-ordercompanies,on-lineshopsetc.
ì Morespecifically:Findsetsofproductsthatarefrequentlyboughttogether.
ì Possibleapplicationsoffoundfrequentitemsets:ì Improvearrangementofproductsinshelves,onacatalog’spagesetc.ì Supportcross-selling(suggestionofotherproducts),productbundling.ì Frauddetection,technicaldependenceanalysis,faultlocalization…etc.
ì Oftenfoundpatternsareexpressedasassociationrules,forexample:ì Ifacustomerbuysbreadandwine,thenshe/hewillprobablyalsobuycheese.
�4
FrequentItemsetMining:Basicnotions
ì Items:
ì Itemset,transaction:
ì Transactionaldataset:
ì Languageofitemsets:
ì Coverofanitemset:
ì (absolute)Frequency:
�5
I = {i1, …, in}
P, T, ⊆ I
D = {T1, …, Tm}
ℒI = 2I
cover(P) = {i |Ti ∈ D ∧ P ⊆ Ti}
freq(P) = |cover(P) |
Absolute/relativefrequency
ì AbsoluteFrequency:
ì RelativeFrequency:
�6
freq(P) = |cover(P) |
freq(P) =1
|D ||cover(P) |
FrequentItemsetMining:Definition
ì Given:ì Asetofitemsì Atransactionaldatasetì Aminimumsupport
ì Theneed:ì ThesetofitemsetPs.t.:
�7
I = {i1, …, in}D = {T1, …, Tm}
θ
freq(P) ≥ θ
Example(1)
�8
I = {a, b, c, d, e}, D = {T1, …, T10}
cover(bc) = {2,7,9}
freq(bc) = 3
Example(1)
�9
Example(1)
�10
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2
Example(1)
�11
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2
Frequentitemset?
Example(1)
�12
7 3
3 3
3
7 76
64 4 4 4
4
5 1 1
1 1 2
2
Frequentitemsetwithminimumsupportθ=3?
SearchingforFrequentItemsets
ì Anaïvesearchthatconsistsofenumeratingandtestingthefrequencyofitemsetcandidatesinagivendatasetisusuallyinfeasible.
ì Why?
�13
Numberofitems(n) Searchspace(2n)
10 ≈103
20 ≈106
30 ≈109
100 ≈1030
128 ≈1068(atomsintheuniverse)
1000 ≈10301
Anti-monotonicityproperty
ì GivenatransactiondatabaseDoveritemsIandtwoitemsetsX,Y:
ì Thatis,
�14
X ⊆ Y ⇒ cover(Y ) ⊆ cover(X)
X ⊆ Y ⇒ freq(Y ) ≤ freq(X)
Example(2)
�15
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2
ade
acde
cover(ade) = {1,4,8,10}, freq(ade) = 4
cover(acde) = {4,8}freq(acde) = 2
Aprioriproperty
ì GivenatransactiondatabaseDoveritemsI,aminsupθandtwoitemsetsX,Y:
ì Itfollows:
ì Contraposition:
�16
Allsubsetsofafrequentitemsetarefrequent!
Allsupersetsofaninfrequentitemsetareinfrequent!
X ⊆ Y ⇒ freq(Y ) ≤ freq(X)
X ⊆ Y ⇒ ( freq(Y ) ≥ θ ⇒ freq(X) ≥ θ)
X ⊆ Y ⇒ ( freq(X) < θ ⇒ freq(Y ) < θ)
Example(3)
�17
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2acde
Allsubsetsofafrequentitemsetarefrequent!
cdeadeaceacd
ac ad ae cd ce de
a c d eθ = 2
Example(3)
�18
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2
be
bce bdeabe
bcdeabdeabce
abcde
Allsupersetsofaninfrequentitemsetareinfrequent!
θ = 2
Partiallyorderedsets
ì Apartialorderisabinaryrelationoveraset:
�19
(reflexivity)
(anti-symmetry)
(transitivity)
ì Comparableitemsets:
ì Incomparableitemsets:
�20
AprioriAlgorithm[AgrawalandSrikant1994]
ì Determinethesupportoftheone-elementitemsets(i.e.singletons)anddiscardtheinfrequentitems.
ì Formcandidateitemsetswithtwoitems(bothitemsmustbefrequent),determinetheirsupport,anddiscardtheinfrequentitemsets.
ì Formcandidateitemsetswiththreeitems(allcontainedpairsmustbefrequent),determinetheirsupport,anddiscardtheinfrequentitemsets.
ì Andsoon!
�21
Basedoncandidategenerationandpruning
AprioriAlgorithm[AgrawalandSrikant1994]
�22
Aprioricandidatesgeneration
�23
Improvingcandidatesgeneration
ì Usingapriori-genfunction,anitemofk+1sizecanbegeneratedinajpossibleways:
ì Need:Generateitemsetcandidateatmostonce.
ì How:Assigntoeachitemsetauniqueparentitemset,fromwhichthisitemsetistobegenerated
�24
Improvingcandidatesgeneration
ì Assigninguniqueparentsturnstheposetlatticeintoatree:
�25
Canonicalformforitemsets
ì Anitemsetcanberepresentedasawordoveranalphabet
ì Q:howmanywordsof3itemscanwehave?Of4items?Ofkitems?
ì Anarbitraryorder(e.g.,lexicographyorder)onitemscangiveacanonicalform,auniquerepresentationofitemsetsbybreakingsymmetries.ì Lexonitems:
�26
RecursiveprocessingwithCanonicalforms
ì ForeachPofagivenlevel,generateallpossibleextensionofPbyoneitemsuchthat:
ì ForeachP’,processitrecursively.
�27
Example(4)
�28
7 3
3 3
3
7 76
64 4 4 4
4
50 1 1
0
0
0 0 0
0 0 1 1 0
0
2
2
Q:whatarethechildrenof:
✗
ItemsOrdering
ì Anyordercanbeused,thatis,theorderisarbitrary
ì Thesearchspacediffersconsiderablydependingontheorder
ì Thus,theefficiencyoftheFrequentItemsetMiningalgorithmscandifferconsiderablydependingontheitemorder
ì Advancedmethodsevenadapttheorderoftheitemsduringthesearch:usedifferent,but“compatible”ordersindifferentbranches
�29
ItemsOrdering(heuristics)
ì Frequentitemsetsconsistoffrequentitemsì Sorttheitemsw.r.t.theirfrequency.(decreasing/increasing)
ì Thesumoftransactionsizes,transactioncontainingagivenitem,whichcapturesimplicitlythefrequencyofpairs,tripletsetc.ì Sortitemsw.r.t.thesumofthesizesofthetransactionsthat
coverthem.
�30
ì
Tutorials
link:http://www.lirmm.fr/~lazaar/imagina/TD1.pdf
�31