![Page 1: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/1.jpg)
Data Mining, Data Warehousing and
Knowledge DiscoveryBasic Algorithms and Concepts
Srinath SrinivasaIIIT [email protected]
![Page 2: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/2.jpg)
Overview• Why Data Mining? • Data Mining concepts • Data Mining algorithms
– Tabular data mining– Association, Classification and Clustering– Sequence data mining– Streaming data mining
• Data Warehousing concepts
![Page 3: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/3.jpg)
Why Data MiningFrom a managerial perspective:
Strategic decision making
Wealth generationAnalyzing trends
Security
![Page 4: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/4.jpg)
Data Mining• Look for hidden patterns and trends
in data that is not immediately apparent from summarizing the data
• No Query…
• …But an “Interestingness criteria”
![Page 5: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/5.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
![Page 6: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/6.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of Patterns
![Page 7: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/7.jpg)
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of data Type of Interestingness criteria
![Page 8: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/8.jpg)
Type of Data• Tabular (Ex: Transaction data)
– Relational– Multi-dimensional
• Spatial (Ex: Remote sensing data)• Temporal (Ex: Log information)
– Streaming (Ex: multimedia, network traffic)– Spatio-temporal (Ex: GIS)
• Tree (Ex: XML data)• Graphs (Ex: WWW, BioMolecular data)• Sequence (Ex: DNA, activity logs) • Text, Multimedia …
![Page 9: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/9.jpg)
Type of Interestingness• Frequency• Rarity• Correlation • Length of occurrence (for sequence and
temporal data)• Consistency • Repeating / periodicity • “Abnormal” behavior • Other patterns of interestingness…
![Page 10: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/10.jpg)
Data Mining vs Statistical Inference
Statistics:
ConceptualModel
(Hypothesis)
StatisticalReasoning
“Proof”(Validation of Hypothesis)
![Page 11: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/11.jpg)
Data Mining vs Statistical Inference
Data mining:
MiningAlgorithmBased on Interestingness
Data
Pattern (model, rule, hypothesis)discovery
![Page 12: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/12.jpg)
Data Mining ConceptsAssociations and Item-sets:
An association is a rule of the form: if X then Y. It is denoted as X YExample:
If India wins in cricket, sales of sweets go up.
For any rule if X Y Y X, then X and Y are called an “interesting item-set”. Example:
People buying school uniforms in June also buy school bags(People buying school bags in June also buy school uniforms)
![Page 13: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/13.jpg)
Data Mining ConceptsSupport and Confidence:
The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of all rules.
The confidence of a rule X Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.
![Page 14: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/14.jpg)
Data Mining ConceptsSupport and Confidence:
BagBooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBookBag
BookBagBag
PencilBooks
Support for {Bag, Uniform} = 5/10 = 0.5
Confidence for Bag Uniform = 5/8 = 0.625
![Page 15: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/15.jpg)
Mining for Frequent Item-setsThe Apriori Algorithm:
Given minimum required support s as interestingness criterion:1. Search for all individual elements (1-element item-set) that
have a minimum support of s 2. Repeat
1. From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s
2. This becomes the set of all frequent (i+1)-element item-sets that are interesting
3. Until item-set size reaches maximum..
![Page 16: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/16.jpg)
Mining for Frequent Item-setsThe Apriori Algorithm: (Example)
BagBooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
Let minimum support = 0.3
Interesting 1-element item-sets:{Bag}, {Uniform}, {Crayons}, {Pencil},{Books}
Interesting 2-element item-sets: {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}{Bag,Books} {Uniform,Crayons} {Uniform,Pencil} {Pencil,Books}
![Page 17: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/17.jpg)
Mining for Frequent Item-setsThe Apriori Algorithm: (Example)
BagBooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
Let minimum support = 0.3
Interesting 3-element item-sets:{Bag,Uniform,Crayons}
![Page 18: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/18.jpg)
Mining for Association RulesBag
BooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
Association rules are of the form A B
Which are directional…
Association rule mining requires two thresholds:
minsup and minconf
![Page 19: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/19.jpg)
Mining for Association RulesBag
BooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
General Procedure:
1. Use apriori to generate frequent itemsets of different sizes
2. At each iteration divide each frequent itemset X into two parts LHS and RHS. This represents a rule of the form LHS RHS
3. The confidence of such a rule is support(X)/support(LHS)
4. Discard all rules whose confidence is less than minconf.
Mining association rules using apriori
![Page 20: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/20.jpg)
Mining for Association RulesBag
BooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
Example:
The frequent itemset {Bag, Uniform, Crayons} has a support of 0.3.
This can be divided into the following rules:
{Bag} {Uniform, Crayons}{Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag}{Crayons} {Bag, Uniform}
Mining association rules using apriori
![Page 21: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/21.jpg)
Mining for Association RulesBag
BooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
Confidence for these rules are as follows:
{Bag} {Uniform, Crayons} 0.375 {Bag, Uniform} {Crayons} 0.6 {Bag, Crayons} {Uniform} 0.75{Uniform} {Bag, Crayons} 0.428 {Uniform, Crayons} {Bag} 0.75{Crayons} {Bag, Uniform} 0.75
Mining association rules using apriori
If minconf is 0.7, then we have discovered the following rules…
![Page 22: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/22.jpg)
Mining for Association RulesBag
BooksBagBag
UniformBag
CrayonsBooks
UniformPencil
UniformBag
UniformPencil
CrayonsPencil
UniformCrayonsCrayonsUniform
CrayonsUniformPencilBooksBag
BooksBagBag
PencilBooks
People who buy a school bag and a set of crayons are likely to buy school uniform.
People who buy school uniform and a set of crayons are likely to buy a school bag.
People who buy just a set of crayons are likely to buy a school bag and school uniform as well.
Mining association rules using apriori
![Page 23: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/23.jpg)
Generalized Association Rules
Since customers can buy any number of items in one transaction, the transaction relation would be in the form of a list of individual purchases.
Bill No. Date Item15563 23.10.200
3Books
15563 23.10.2003
Crayons
15564 23.10.2003
Uniform
15564 23.10.2003
Crayons
![Page 24: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/24.jpg)
Generalized Association Rules
A transaction for the purposes of data mining is obtained by performing a GROUP BY of the table over various fields.
Bill No. Date Item15563 23.10.200
3Books
15563 23.10.2003
Crayons
15564 23.10.2003
Uniform
15564 23.10.2003
Crayons
![Page 25: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/25.jpg)
Generalized Association Rules
A GROUP BY over Bill No. would show frequent buying patterns across different customers. A GROUP BY over Date would show frequent buying patterns across different days.
Bill No. Date Item15563 23.10.200
3Books
15563 23.10.2003
Crayons
15564 23.10.2003
Uniform
15564 23.10.2003
Crayons
![Page 26: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/26.jpg)
Classification and ClusteringGiven a set of data elements:
Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes
Clustering groups data elements into different groups based on the similarity between elements within a single group
![Page 27: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/27.jpg)
Classification TechniquesDecision Tree Identification
Outlook Temp Play?Sunny 30 YesOvercast 15 NoSunny 16 YesCloudy 27 YesOvercast 25 YesOvercast 17 NoCloudy 17 NoCloudy 35 Yes
Classification problem
Weather Play(Yes,No)
![Page 28: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/28.jpg)
Classification TechniquesHunt’s method for decision tree identification:
Given N element types and m decision classes: 1. For i 1 to N do
1. Add element i to the i-1 element item-sets from the previous iteration
2. Identify the set of decision classes for each item-set3. If an item-set has only one decision class, then that
item-set is done, remove that item-set from subsequent iterations
2. done
![Page 29: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/29.jpg)
Classification TechniquesDecision Tree Identification Example
Outlook Temp Play?Sunny Warm YesOvercast Chilly NoSunny Chilly YesCloudy Pleasa
ntYes
Overcast Pleasant
Yes
Overcast Chilly NoCloudy Chilly NoCloudy Warm Yes
Sunny
Cloudy
Overcast
Yes
Yes/No
Yes/No
![Page 30: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/30.jpg)
Classification TechniquesDecision Tree Identification Example
Outlook Temp Play?Sunny Warm YesOvercast Chilly NoSunny Chilly YesCloudy Pleasa
ntYes
Overcast Pleasant
Yes
Overcast Chilly NoCloudy Chilly NoCloudy Warm Yes
Sunny
Cloudy
Overcast
Yes
Yes/No
Yes/No
![Page 31: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/31.jpg)
Classification TechniquesDecision Tree Identification Example
Outlook Temp Play?Sunny Warm YesOvercast Chilly NoSunny Chilly YesCloudy Pleasa
ntYes
Overcast Pleasant
Yes
Overcast Chilly NoCloudy Chilly NoCloudy Warm Yes
CloudyWarm Yes
CloudyChilly No
CloudyPleasant Yes
![Page 32: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/32.jpg)
Classification TechniquesDecision Tree Identification Example
Outlook Temp Play?Sunny Warm YesOvercast Chilly NoSunny Chilly YesCloudy Pleasa
ntYes
Overcast Pleasant
Yes
Overcast Chilly NoCloudy Chilly NoCloudy Warm Yes
OvercastWarm
OvercastChilly No
OvercastPleasant Yes
![Page 33: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/33.jpg)
Classification TechniquesDecision Tree Identification Example
Yes/No
Yes/No Yes Yes/No
SunnyCloudy Overcast
Yes No YesNo
Yes
WarmChilly
Pleasant Chilly
Pleasant
![Page 34: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/34.jpg)
Classification TechniquesDecision Tree Identification Example
• Top down technique for decision tree identification
• Decision tree created is sensitive to the order in which items are considered
• If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets.
![Page 35: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/35.jpg)
Other Classification AlgorithmsQuinlan’s depth-first strategy builds the decision tree in a depth-first fashion, by considering all possible tests that give a decision and selecting the test that gives the best information gain. It hence eliminates tests that are inconclusive.
SLIQ (Supervised Learning in Quest) developed in the QUEST project of IBM uses a top-down breadth-first strategy to build a decision tree. At each level in the tree, an entropy value of each node is calculated and nodes having the lowest entropy values selected and expanded.
![Page 36: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/36.jpg)
Clustering TechniquesClustering partitions the data set into clusters or equivalence classes.
Similarity among members of a class more than similarity among members across classes.
Similarity measures: Euclidian distance or other application specific measures.
![Page 37: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/37.jpg)
Euclidian Distance for Tables
Warm Pleasant Chilly
Sunny
Cloudy
Overcast
Play
Don’t Play
(Cloudy,Pleasant,Play)
(Overcast,Chilly,Don’t Play)
![Page 38: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/38.jpg)
Clustering TechniquesGeneral Strategy:
1. Draw a graph connecting items which are close to one another with edges.
2. Partition the graph into maximally connected subcomponents. 1. Construct an MST for the graph2. Merge items that are connected by the minimum
weight of the MST into a cluster
![Page 39: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/39.jpg)
Clustering TechniquesClustering types:
Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
Partitional clustering: Clusters are formed at only one level
![Page 40: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/40.jpg)
Clustering TechniquesNearest Neighbour Clustering Algorithm:
Given n elements x1, x2, … xn, and threshold t, . 1. j 1, k 1, Clusters = {} 2. Repeat
1. Find the nearest neighbour of xj 2. Let the nearest neighbour be in cluster m 3. If distance to nearest neighbour > t, then create a new
cluster and k k+1; else assign xj to cluster m 4. j j+1
3. until j > n
![Page 41: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/41.jpg)
Clustering TechniquesIterative partitional clustering:
Given n elements x1, x2, … xn, and k clusters, each with a center.
1. Assign each element to its closest cluster center 2. After all assignments have been made, compute the
cluster centroids for each of the cluster 3. Repeat the above two steps with the new centroids until
the algorithm converges
![Page 42: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/42.jpg)
Mining Sequence DataCharacteristics of Sequence Data:
• Collection of data elements which are ordered sequences
• In a sequence, each item has an index associated with it
• A k-sequence is a sequence of length k. Support for sequence j is the number of m-sequences (m>=j) which contain j as a sequence
• Sequence data: transaction logs, DNA sequences, patient ailment history, …
![Page 43: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/43.jpg)
Mining Sequence DataSome Definitions:
• A sequence is a list of itemsets of finite length. • Example:
• {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}• … the purchases of a single customer over time…
• The order of items within an itemset does not matter; but the order of itemsets matter • A subsequence is a sequence with some itemsets deleted
![Page 44: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/44.jpg)
Mining Sequence DataSome Definitions:
• A sequence S’ = {a1, a2, …, am} is said to be contained within another sequence S, if S contains a subsequence {b1, b2, … bm} such that a1 b1, a2 b2, …, am bm.
• Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
![Page 45: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/45.jpg)
Mining Sequence DataApriori Algorithm for Sequences:
1. L1 Set of all interesting 1-sequences 2. k 13. while Lk is not empty do
1. Generate all candidate k+1 sequences 2. Lk+1 Set of all interesting k+1-sequences
4. done
![Page 46: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/46.jpg)
Mining Sequence DataGenerating Candidate Sequences:
Given L1, L2, … Lk, candidate sequences of Lk+1 are generated as follows:
For each sequence s in Lk, concatenate s with all new 1-sequences found while generating Lk-1
![Page 47: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/47.jpg)
Mining Sequence DataExample: minsup = 0.5 a b c d e Interesting 1-sequences: b d a e a a e b d b b e d e a b d a e a a a a b a a a Candidate 2-sequences c b d b aa, ab, ad, ae a b b a b ba, bb, bd, be a b d e da, db, dd, de ea, eb, ed, ee
![Page 48: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/48.jpg)
Mining Sequence DataExample: minsup = 0.5 a b c d e Interesting 2-sequences: b d a e ab, bd a e b d b e Candidate 2-sequences e a b d a aba, abb, abd, abe, a a a a aab, bab, dab, eab, b a a a bda, bdb, bdd, bde, c b d b bbd, dbd, ebd. a b b a b a b d e Interesting 3-sequences = {}
![Page 49: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/49.jpg)
Mining Sequence DataLanguage Inference:
Given a set of sequences, consider each sequence as the behavioural trace of a machine, and infer the machine that can display the given sequence as behavior.
aabb ababcac abbac
…
Input set of sequences Output state machine
![Page 50: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/50.jpg)
Mining Sequence Data• Inferring the syntax of a language
given its sentences • Applications: discerning behavioural
patterns, emergent properties discovery, collaboration modeling, …
• State machine discovery is the reverse of state machine construction
• Discovery is “maximalist” in nature…
![Page 51: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/51.jpg)
Mining Sequence Data“Maximal” nature of language inference:
abc aabc aabbc abbc
a,b,c
a bc
ab
c
b
c
bc
“Most general” state machine
“Most specific” state machine
![Page 52: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/52.jpg)
Mining Sequence Data“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
Given a set of n sequences: 1. Create a state machine for the first sequence 2. for j 2 to n do
1. Create a state machine for the jth sequence 2. Merge this sequence into the earlier sequence as follows:
1. Merge all halt states in the new state machine to the halt state in the existing state machine
2. If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path
3. Done
![Page 53: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/53.jpg)
Mining Sequence Data“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
aabcb
aac
aabc
a a b c b
a a b c bc
a a b c b
c
a a c bb
![Page 54: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/54.jpg)
Mining Streaming DataCharacteristics of streaming data:
• Large data sequence
• No storage
• Often an infinite sequence
• Examples: Stock market quotes, streaming audio/video, network traffic
![Page 55: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/55.jpg)
Mining Streaming DataRunning mean:
Let n = number of items read so far,
avg = running average calculated so far,
On reading the next number num:
avg (n*avg+num) / (n+1) n n+1
![Page 56: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/56.jpg)
Mining Streaming DataRunning variance:
var = (num-avg)2
= num2 - 2*num*avg + avg2
Let A = num2 of all numbers read so far B = 2*num*avg of all numbers read so far C = avg2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far
![Page 57: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/57.jpg)
Mining Streaming DataRunning variance:
On reading next number num:
avg (avg*n + num) / (n+1) n n+1
A A + num2
B B + 2*avg*num C C + avg2
var = A + B + C
![Page 58: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/58.jpg)
Mining Streaming Data-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
Let streaming data be in the form of “frames” where each frame comprises of one or more data elements.
Support for data element k within a frame is defined as (#occurrences of k)/(#elements in frame)
-Consistency for data element k is the “sustained” support for k over all frames read so far, with a “leakage” of (1- )
![Page 59: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/59.jpg)
Mining Streaming Data-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
*sup(k)
(1-)
levelt(k) = (1-)*levelt-1(k) + *sup(k)
![Page 60: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/60.jpg)
Data Warehousing• A platform for online analytical processing (OLAP) • Warehouses collect transactional data from several
transactional databases and organize them in a fashion amenable to analysis
• Also called “data marts”• A critical component of the decision support system
(DSS) of enterprises• Some typical DW queries:
– Which item sells best in each region that has retail outlets
– Which advertising strategy is best for South India? – Which (age_group/occupation) in South India likes fast
food, and which (age_group/occupation) likes to cook?
![Page 61: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/61.jpg)
Data WarehousingOrder Processing
Inventory
Sales
Data Cleaning
DataWarehouse
(OLAP)
OLTP
![Page 62: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/62.jpg)
OLTP vs OLAPTransactional Data
(OLTP)Analysis Data (OLAP)
Small or medium size databases
Very large databases
Transient data Archival dataFrequent insertions and updates
Infrequent updates
Small query shadow Very large query shadowNormalization important to handle updates
De-normalization important to handle queries
![Page 63: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/63.jpg)
Data Cleaning• Performs logical transformation of
transactional data to suit the data warehouse
• Model of operations model of enterprise
• Usually a semi-automatic process
![Page 64: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/64.jpg)
Data CleaningOrders
Order_idPrice
Cust_id
InventoryProd_id
PricePrice_chng Sales
Cust_idCust_profTot_sales
Data Warehouse
CustomersProductsOrdersInventoryPriceTime
![Page 65: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/65.jpg)
Multi-dimensional Data Model
Time
Jan’01 Jun’01 Jan’02 Jun’02
Pric
e
Customers Prod
ucts
Orders
![Page 66: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/66.jpg)
Some MDBMS Operations• Roll-up
– Add dimensions• Drill-down
– Collapse dimensions• Vector-distance operations (ex:
clustering)• Vector space browsing
![Page 67: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/67.jpg)
Star Schema
Fact tableDimTbl_1
DimTbl_1
DimTbl_1
DimTbl_1
![Page 68: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/68.jpg)
WWW Based References• http://www.kdnuggets.com/• http://www.megaputer.com/• http://www.almaden.ibm.com/cs/quest/index.html • http://fas.sfu.ca/cs/research/groups/DB/sections/
publication/kdd/kdd.html • http://www.cs.su.oz.au/~thierry/ckdd.html • http://www.dwinfocenter.org/ • http://datawarehouse.itoolbox.com/ • http://www.knowledgestorm.com/ • http://www.bitpipe.com/ • http://www.dw-institute.com/ • http://www.datawarehousing.com/
![Page 69: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/69.jpg)
References• Agrawal, R. Srikant: ``Fast Algorithms for Mining Association
Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.
• R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.
• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.
• Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997.
• Jennifer Widom. Research Problems in Data Warehousing. Proc. of Int’l Conf. On Information and Knowledge Management, 1995.
![Page 70: Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681634a550346895dd3daea/html5/thumbnails/70.jpg)
References• A. Shoshani. OLAP and Statistical Databases: Similarities
and Differences. Proc. of ACM PODS 1997. • Panos Vassiliadis, Timos Sellis. A Survey on Logical
Models for OLAP Databases. ACM SIGMOD Record• M. Gyssens, Laks VS Lakshmanan. A Foundation for
Multi-Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK.
• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.