presented by yaron gonen. outline introduction problems definition and motivation previous work the...
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/1.jpg)
Constraint Mining of Frequent Patterns in Long Sequences
Presented by Yaron Gonen
![Page 2: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/2.jpg)
OutlineIntroduction
Problems definition and motivationPrevious work
The CAMLS AlgorithmOverviewMain contributionsResults
Future Work
![Page 3: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/3.jpg)
Frequent Item-sets:The Market-Basket Model
A set of items, e.g., stuff sold in a supermarket
A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.
![Page 4: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/4.jpg)
SupportSupport for item-set I = the number of
baskets containing all items in I (Usually given as a percentage)
Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item-sets
Simplest question: find sets of frequent item-sets
![Page 5: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/5.jpg)
ExampleItems:
Minimum Support = 0.6 (2 baskets)
![Page 6: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/6.jpg)
Application (1)Items: products at a supermarketBaskets: set of products a customer bought
at one time.Example: many people by beer and diapers
together. Place beer next to diapers to increase both
salesRun a sale on diapers and raise price of beer.
![Page 7: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/7.jpg)
Application (2)(Counter-Intuitive)Items: species of plantsBaskets: each basket represent an attribute.
A basket contains items (plants) that have that attribute
Frequent sets may indicate similarity between plants
![Page 8: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/8.jpg)
Scale of ProblemCostco sells more than 120k different items,
and has 57m members (from Wikipedia)Botany has identified about 350k extant
species of plants
![Page 9: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/9.jpg)
The Naïve AlgorithmGenerate all possible itemsets.Check their support.
I
, ,,, ,
,,
,,
… , I2
![Page 10: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/10.jpg)
The Apriori PropertyAll nonempty subsets of a frequent itemset
must also be frequent.
XX
XX
![Page 11: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/11.jpg)
The Apriori AlgorithmFind
frequent 1-itemsets
Merge and prune to generate
candidate of next size
Has candidate
s?
End
Go though whole DB to
count support
yes
no
> min support?
Frequent itemset
Here’s where the apriori property is
used.
Largest itemset’s length times
going over the DB
![Page 12: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/12.jpg)
Vertical FormatIndex on items.Calculating support is fast
1 2 3
2 12
123
3
![Page 13: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/13.jpg)
Frequent Sequences:Taking it to the Next Level
A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time
2 weeks 5 days
![Page 14: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/14.jpg)
SupportSubsequence: a sequence, that all its events
are subsets of another sequence, in the same order (but not necessarily consecutive)
Support for subsequence s = the number of sequences containing s (Usually given as a percentage)
Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequences
Simplest question: find all frequent subsequence
![Page 15: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/15.jpg)
NotationsItems are letters: a,b,…Events are parenthesized: (ab), (bdf),…
Except for events with single itemsSequences are surrounded by <…>Every sequence has an identifier sid
![Page 16: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/16.jpg)
Example
sid sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
Frequent Sequence
s
<a>:4
<(a)(a)>:3
<(a)(c)>:4
<(a)(bc)>:2
<(e)(a)(c)>:2
…
minSup = 0.5
![Page 17: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/17.jpg)
MotivationCustomer shopping patternsStock market fluctuationWeblog click stream analysisSymptoms of a diseasesDNA sequence analysisWeather forecastMachine anti agingMany more…
![Page 18: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/18.jpg)
Much Harder than Frequent Item-sets!2m*n possible candidates!
Where m is the number of items, and n in the number of transactions in the longest sequence
![Page 19: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/19.jpg)
The Apriori PropertyIf a sequence is not frequent, then any
sequence that contains it cannot be frequent
![Page 20: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/20.jpg)
ConstraintsProblems:
Too many frequent sequencesmost frequent sequences are not useful
Solution remove themConstraints are a way to define usefulness The trick do so while mining
![Page 21: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/21.jpg)
Previous WorkGSP (Srikant and Agrawal, 1996)
Generation-and-test Apriori Based approachSPADE (Zaki, 2001)
Generation-and-test Apriori Based approachUses equivalence-class for memory
optimizationUses a vertical-format db
PrefixSpan (Pei, 2004)No candidate generationUsing a db-projection method
![Page 22: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/22.jpg)
Why a New Algorithm?Huge set of candidate-sequences/projected
db generatedMultiple Scans of database neededInefficient for mining long sequential
patternsNo exploits of domain-specific propertiesWeak constraints support
![Page 23: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/23.jpg)
The CAMLS AlgorithmConstraint-based Apriori algorithm for
Mining Long SequencesDesigned especially for efficient mining of
long sequencesOutperforms SPADE and PrefixSpan on both
synthetic and real data
![Page 24: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/24.jpg)
The CAMLS AlgorithmMakes a logical distinction between two types
of constraints:Intra-Event: not time related (i.e. mutually
exclusive items)Inter-Event: addresses the temporal aspect
of the data (i.e. values that can or cannot appear one after the other)
![Page 25: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/25.jpg)
Event-wise ConstraintsEvent must/must not contain a specific itemTwo items cannot occur on the same timemax_event_length: An event cannot contain
more than a fixed number of items
![Page 26: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/26.jpg)
Sequence–wise Constraintsmax_sequence_length: a sequence cannot
contain more than a fixed number of eventsmax_gap: long time between events
dismisses the pattern
![Page 27: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/27.jpg)
CAMLS Overview
Constraints (minSup, maxGap, …)
Input
Event-wise
Sequence-wise
Output
Frequent events + occurrence index
![Page 28: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/28.jpg)
What Do We Get?The best of both worlds:
Much less candidates are being generated.Support check is fast.Worst case: works like SPADE.Tradeoff: Uses a bit more memory (for storing
the frequent item-sets).
![Page 29: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/29.jpg)
Event-wise PhaseInput: sequence database and constraintsOutput: frequent events + occurrence
indexUse Apriori or FP-Growth to find frequent
itemsets (both with minor modifications)
![Page 30: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/30.jpg)
Event-wise1. L1 = all frequent items
2. for k=2;Lk-1≠Φ;k++ do
1. generateCandidates(Lk-1)
2. Lk = pruneCandidates()
3. L = L Lk
3. end for
Example soon!
If two frequent (k-1) event have the same prefix merge them and form a new candidate
Prune, calculate support count and create occurrence index
![Page 31: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/31.jpg)
Occurrence IndexA compact representation of all occurrences
of a sequenceStructure: list of sids, each associated with a
list of eids
Example on next slide!
eid1
sid1
sid2
sid3
eid2 eid3
eid4 eid5
eid6 eid7 eid8 eid9
sequence
![Page 32: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/32.jpg)
Event-wise Example(Using Apriori)
event eid sid
(acd) 0 1
(bcd) 5 1
b 10 1
a 0 2
c 4 2
(bd) 8 2
(cde) 0 3
e 7 3
(acd) 11 3
minSup=2
All frequent items:a:3, b:2, c:3, d:3
candidates:(ab),(ac),(ad),(bc),…
Support count:(ac):2, (ad):2, (bd):2, (cd):2
candidates:(abc), (abd),(acd),…
Support count:(acd):2
No more candidates!
13
0
11
![Page 33: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/33.jpg)
Sequence-wise PhaseInput: frequent events + occurrence
index, constraintsOutput: all frequent sequencesSimilar to GSP’s and SPADE’s candidate
generation phase – except using the frequent itemsets as seeds
![Page 34: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/34.jpg)
Sequence-wise1. L1 = all frequent 1-sequences
2. for k=2;Lk-1≠Φ;k++ do
1. generateCandidates(Lk-1)
2. Lk = pruneAndSupCalc()
3. L = L Lk
3. end forElaboration on next two slide
![Page 35: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/35.jpg)
Sequence-wise Candidate GenerationIf two frequent k-sequences s’ and s’’ share a
common k-1 prefix and s1 is a generator, we form a new candidate
s‘ = <s’1s’2…s’k>
s’’ = <s’’1s’’2…s’’k><s’1s’2…s’k-1> =
<s’’1s’’2…s’’k-1>
<s’1s’2…s’k s’’k
>
![Page 36: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/36.jpg)
Sequence-wise Pruning1. Keep a radix-ordered list of pruned sequences in current
iteration2. In the same iteration its possible that a k-sequence will
contain another k-sequence in the same iteration.3. With a new candidate:
1. Check subsequence in pruned list: Very Fast!2. Test for frequency3. Add to pruned list if needed
![Page 37: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/37.jpg)
Support CalculationA simple intersection operation between the
occurrence index of the forming sequencesWhen a new occurrence index is formed,
calculation is trivial
![Page 38: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/38.jpg)
The maxGap ConstraintmaxGap is a special kind of constraint:
Data dependantApriori property not applicable
The occurrence index enables fast maxGap checkA frequent sequence that does not satisfy maxGap is
flagged as non-generator. Example: Assume <ab> is frequent but gap between a and b > maxgap But frequent sequences <ac> and <ab> and in <acb> all
maxgap constraints are ok! So <ab> is a non-Generator but kept in order not to prune
<acb>
![Page 39: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/39.jpg)
Sequence-Wise Exampleevent
eid
sid
(acd)
0 1
(bcd)
5 1
b 10 1
a 0 2
c 4 2
(bd) 8 2
(cde)
0 3
e 7 3
(acd)
11 3
Original DB
Event-wise
minSup=2maxGap=5
g <(a)> : 3
g <(b)> : 2
g <(c)> : 3
g <(d)> : 3
g <(ac)> : 2
g <(ad)> : 2
g <(bd)> : 2
g <(cd)> : 2
g <(acd)> : 2
Candidate generation
<aa>
<ab>
…
<a(acd)>
<ba>
<bb>
…
<(acd) (acd)>
<aa> is added to pruned list.<a(ac)> is a super-sequence of <aa>, therefore it is pruned.<ab> does not pass maxGap, therefore it is not a generator.
<ab> : 2
g <ac> : 2
<ad> : 2
<a(bd)> : 2
g <cb> : 2
g <cd> : 2
g <c(bd)> : 2
<dc> : 2
<dd> : 2
<d(cd)> : 2
<acb>
<acd>
…
<acb>:2
No more candidates!
![Page 40: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/40.jpg)
Evaluation (1):Machine Anti Aging
How can Sequence Mining Help?Data collected from machine is a sequenceDiscover typical behavior leading to failureMonitor machine and alert before failureDomain:
Light intensity for wavelengths (continuous)Pre-process
DiscretizationMeta features (maxDisc, maxWL, isBurned)
Synm stands for a synthetic database simulating the machine behavior with m meta-features
![Page 41: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/41.jpg)
Evaluation (2)Real Stocks data values
Rn stands for stock data (10 different stocks) for n days
![Page 42: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/42.jpg)
CAMLS Compared with PrefixSpan
![Page 43: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/43.jpg)
CAMLS Compared with Spade and PrefixSpan
![Page 44: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/44.jpg)
So, What’s CAMLS Contribution?Constraints distinction: easy implementationTwo phasesHandling on the MaxGap constraintOccurrence index data structureFast new pruning method
![Page 45: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/45.jpg)
Future ResearchMain issue: closed sequencesMore constraints (aspiring regexp)
![Page 46: Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results](https://reader038.vdocuments.us/reader038/viewer/2022103123/56649d695503460f94a475db/html5/thumbnails/46.jpg)
Thank You!