![Page 1: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/1.jpg)
BITMAPS & Starjoins
![Page 2: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/2.jpg)
Indexing datacubes
Objective: speed queries up.
Traditional databases (OLTP): B-Trees
• Time and space logarithmic to the amount of indexed keys.
• Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….)
Bitmaps:
• Space efficient
• Difficult to update (but we don’t care in DW).
• Can effectively prune searches before looking at data.
![Page 3: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/3.jpg)
BitmapsR = (…., A,….., M)
R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0
3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0
![Page 4: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/4.jpg)
Query optimization
Consider a high-selectivity-factor query with predicates on two attributes.
Query optimizer: builds plans(P1) Full relation scan (filter as you go).(P2) Index scan on the predicate with lower selectivity
factor, followed by temporary relation scan, to filter out non-qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key).
(P3) Index scan for each predicate (separately), followed by merge of RID.
![Page 5: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/5.jpg)
Query optimization (continued)
(P2)
Blocks of data
Pred. 2
answer
t1
tn
Index Pred1
(P3)
t1
tn
Index Pred2
Tuple list1
Tuple list2
Merged list
![Page 6: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/6.jpg)
Query optimization (continued)
When using bitmap indexes (P3) can be an easy winner!
CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps
(In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--)
Of course, you can build B-trees on the compound key, butwe would need one for every compound predicate (exponential number of trees…).
![Page 7: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/7.jpg)
Bitmaps and predicates
A = a1 AND B = b2
Bitmap for a1 Bitmap for b2
AND =
Bitmap for a1 and b2
![Page 8: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/8.jpg)
Tradeoffs
Dimension cardinality small dense bitmaps
Dimension cardinality large sparse bitmaps
Compression
(decompression)
![Page 9: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/9.jpg)
Bitmap for prod
Bitmap for prod
…..
Query strategy for Star joinsMaintain join indexes between fact table and dimension tables
Prod.
Fact tableProduct Type Location
Dimension table
a ... k
…
…
Bitmap for type a
Bitmap for type k
…..Bitmap for loc.
Bitmap for loc.
…..
![Page 10: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/10.jpg)
Strategy example
Aggregate all sales for products of location , or
Bitmap for Bitmap for Bitmap for
OR OR =
Bitmap for predicate
![Page 11: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/11.jpg)
Star-JoinsSelect F.S, D1.A1, D2.A2, …. Dn.An
from F,D1,D2,Dn where F.A1 = D1.A1
F.A2 = D2.A2 … F.An = Dn.An
and D1.B1 = ‘c1’ D2.B2 = ‘p2’ ….
Likely strategy:
For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps).
At this stage, you have n such bitmaps, the result can be found AND-ing them.
![Page 12: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/12.jpg)
Example
Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent)Total selectivity = 10 -2n
Facts table = 108 rows, n = 3, tuples in answer = 108/ 106 = 100 rows. In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be 106 blocks!)
![Page 13: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/13.jpg)
Design Space of Bitmap Indexes
The basic bitmap design is called Value-list index. The focus there is on the columns. If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way.
5 0 0 0 1 0 0 0 0 0
We can encode this row in many ways...
![Page 14: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/14.jpg)
Attribute value decomposition
C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers <bn-1, bn-2 , …,b1>. Also, define bn = C / bi , then v can be decomposed into a sequence of n digits <vn, vn-1, vn-2 , …,v1> as follows:
v = V1
= V2 b1 + v1
= V3(b2b1) + v2 b1 + v1
… n-1 i-1 = vn ( bj) + …+ vi ( bj) + …+ v2b1 + v1
where vi = Vi mod bi and Vi = Vi-1/bi-1
![Page 15: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/15.jpg)
<10,10,10> (decimal system!)
576 = 5 x 10 x 10 + 7 x 10 + 6
576/100 = 5 | 76
76/10 = 7 | 6
6
Number systems
How do you write 576 in:
<2,2,2,2,2,2,2,2,2>
576 = 1 x 29 + 0 x 28 + 0 x 27 + 1 x 26 + 0 x 25 + 0 x 24 + 0 x 23 +
0 x 22+ 0 x 21 + 0 x 20
576/ 29 = 1 | 64, 64/ 28 = 0|64, 64/ 27 = 0|64, 64/ 26 = 1|0,
0/ 25 = 0|0, 0/ 24= 0|0, 0/ 23= 0|0, 0/ 22 = 0|0, 0/ 21 = 0|0, 0/
20 = 0|0
< 7,7,5,3>
576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51
576 = 5 x (7x5x3)+51
51/(5x3) = 51/15 = 3 | 6
576 = 5 x (7x5x3) + 3 (5 x 3) + 16
6/3 =2 | 0
576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3)
![Page 16: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/16.jpg)
BitmapsR = (…., A,….., M) value-list index
R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0
3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0
![Page 17: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/17.jpg)
Examplesequence <3,3> value-list index (equality)
R (A) B22
B12
B02 B2
1 B11 B0
1
3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0
![Page 18: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/18.jpg)
Encoding scheme
Equality encoding: all bits to 0 except the one that corresponds to the value
Range Encoding: the vi righmost bits to 0, the remaining to 1
![Page 19: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/19.jpg)
Range encodingsingle component, base-9
R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0
3 1 1 1 1 1 1 0 0 0 2 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 8 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 7 1 1 0 0 0 0 0 0 0 5 1 1 1 1 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 4 1 1 1 1 1 0 0 0 0
![Page 20: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/20.jpg)
Example (revisited)sequence <3,3> value-list index(Equality)
R (A) B22
B12
B02 B2
1 B11 B0
1
3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0
![Page 21: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/21.jpg)
Examplesequence <3,3> range-encoded index
R (A) B12
B02 B1
1 B01
3 1 0 1 1 2 1 1 0 0 1 1 1 1 0 2 1 1 0 0 8 0 0 0 0 2 1 1 0 0 2 1 1 0 0 0 1 1 1 1 7 0 0 1 0 5 1 0 0 0 6 0 0 1 1 4 1 0 1 0
![Page 22: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/22.jpg)
Design Space
b Value-list
log2C b,b,…,b
Bit-Sliced
<b2,b1>
….
equality range
![Page 23: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/23.jpg)
RangeEval
Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT
RangeEval-Opt uses only <=
A < v is the same as A <= v-1
A > v is the same as Not( A <= v)
A >= v is the same as Not (A <= v-1)
![Page 24: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/24.jpg)
RangeEval-OPT
![Page 25: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/25.jpg)
![Page 26: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/26.jpg)
• Classification: – predicts categorical class labels– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and uses it in classifying new data
• Prediction: – models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications– credit approval– target marketing– medical diagnosis– treatment effectiveness analysis
Classification vs. Prediction
![Page 27: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/27.jpg)
• Pros:– Fast.
– Rules easy to interpret.
– High dimensional data
• Cons:– No correlations
– Axis-parallel cuts.
• Supervised learning (classification)– Supervision: The training data
(observations, measurements, etc.) are accompanied by labels indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)– The class labels of training data is unknown– Given a set of measurements, observations,
etc. with the aim of establishing the existence of classes or clusters in the data
• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases– Tree construction
• At start, all the training examples are at the root• Partition examples recursively based on selected attributes
– Tree pruning• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree
![Page 28: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/28.jpg)
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer
manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are
discretized in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf– There are no samples left
![Page 29: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/29.jpg)
Decision tree algorithms• Building phase:
– Recursively split nodes using best splitting attribute and value for node
• Pruning phase:– Smaller (yet imperfect) tree achieves better
prediction accuracy.– Prune leaf nodes recursively to avoid over-fitting.
DATA TYPES• Numerically ordered: values are ordered and they can
be represented in real line. ( E.g., salary.)• Categorical: takes values from a finite set not having
any natural ordering. (E.g., color.)• Ordinal: takes values from a finite set whose values
posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.)
![Page 30: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/30.jpg)
Some probability...S = casesfreq(Ci,S) = # cases in S that belong to CiGain entropic meassure:Prob(“this case belongs to Ci”) = freq(Ci,S)/|S|Information conveyed: -log (freq(Ci,S)/|S|)Entropy = expected information =- (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S)
GAIN
Test X:
infoX (T) = |Ti|/T info(Ti)
gain(X) = info (T) - infoX(T)
![Page 31: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/31.jpg)
PROBLEM:What is best predictor to segment on?- windy or the outlook?
![Page 32: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/32.jpg)
![Page 33: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/33.jpg)
![Page 34: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/34.jpg)
Problem with Gain
Strong bias towards test with many outcomes.
Example: Z = Name
|Ti| = 1 (each name unique)
infoZ (T) = 1/|T| (- 1/N log (1/N)) 0
Maximal gain!! (but useless division--- overfitting--)
![Page 35: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/35.jpg)
Split
Split-info (X) = - |Ti|/|T| log (|Ti|/|T|)
gain-ratio(X) = gain(X)/split-info(X)
Gain <= log(k)
Split <= log(n)
ratio small
![Page 36: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/36.jpg)
• The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers–Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold• Difficult to choose an appropriate threshold
–Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”
• Approaches to Determine the Final Tree Size• Separate training (2/3) and testing (1/3) sets• Use cross validation, e.g., 10-fold cross validation• Use all the data for training• but apply a statistical test (e.g., chi-square) to estimate whether expanding or
pruning a node may improve the entire distribution• Use minimum description length (MDL) principle: • halting growth of the tree when the encoding is minimized
![Page 37: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/37.jpg)
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
n
jp jTgini
1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginisplit
![Page 38: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/38.jpg)
Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H
Training set
Age Car
Attribute lists
Problem: What is the best way to determine risk? Is it Age or Car?
![Page 39: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/39.jpg)
SplitsAge Risk Tuple
17 H 120 H 523 H 032 L 443 H 368 L 2
Age < 27.5
Car Type Risk TupleFamily H 0Sports H 1Family H 5
Car Type Risk TupleSports H 2Family L 3Truck L 4
Age Risk Tuple17 H 120 H 523 H 0
Age Risk Tuple32 L 443 H 268 L 3
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
Group1 Group2
![Page 40: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/40.jpg)
Histograms
For continuous attributes
Associated with node (Cabove, Cbelow)
to process already processed
![Page 41: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/41.jpg)
![Page 42: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/42.jpg)
![Page 43: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/43.jpg)
![Page 44: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/44.jpg)
![Page 45: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/45.jpg)
![Page 46: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/46.jpg)
![Page 47: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/47.jpg)
![Page 48: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/48.jpg)
![Page 49: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/49.jpg)
![Page 50: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/50.jpg)
![Page 51: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/51.jpg)
ANSWER
The winner is Age <= 18.5
Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4
Family H 5
H
Y N
Age Risk Tuple20 H 523 H 032 L 443 H 368 L 2
Car Type Risk TupleFamily H 0
Sports H 2Family L 3Truck L 4
Family H 5
![Page 52: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/52.jpg)
Summary
• Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
• Classification is probably one of the most widely used
data mining techniques with a lot of extensions
• Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
• Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..
![Page 53: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/53.jpg)
Association rules a* priori paper – student plays basketball example
![Page 54: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/54.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Summary
![Page 55: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/55.jpg)
Association Rules
• Market basket data: your ``supermarket’’ basket contains {bread, milk, beer, diapers…}
• Find rules that correlate the presence of one set of items X with another Y.– Ex: X = diapers, Y= beer, X Y with
confidence 98%– Maybe constrained: e.g., consider only
female customers.
![Page 56: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/56.jpg)
Applications
• Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets.
• Marketing: “people who bought this book also bought…”
• Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day.
• Shelf planning: given the “best sellers,” how do I organize my shelves?
![Page 57: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/57.jpg)
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
![Page 58: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/58.jpg)
Association Rule Mining: A Road Map
• Boolean vs. quantitative associations (Based on the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
![Page 59: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/59.jpg)
Road-map (continuation)
• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of
diapers?
• Various extensions– Correlation, causality analysis
• Association does not necessarily imply correlation or causalityCausality: Does Beer Diapers or Diapers Beer (I.e., did the
customer buy the diapers because he bought the beer or was it the other way around)
Correlation: 90% buy coffee, 25 % buy tea, 20% buy both--- support is less than expected support = 0.9*0.25 = 0.225--
– Maxpatterns and closed itemsets– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
![Page 60: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/60.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Summary
![Page 61: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/61.jpg)
Mining Association Rules—An Example
For rule A C:support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
![Page 62: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/62.jpg)
Mining Frequent Itemsets: the Key Step
• Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a
frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be
a frequent itemset
– Iteratively find frequent itemsets with cardinality from
1 to k (k-itemset)
• Use the frequent itemsets to generate
association rules.
![Page 63: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/63.jpg)
Problem decomposition
Two phases:
• Generate all itemsets whose support is above a threshold. Call them large (or hot) itemsets. (Any other itemset is small.)
How? Generate all combinations? (exponential!) (HARD.)
• For a given large itemset
Y = I1 I2 … Ik k >= 2
Generate (at most k rules) X Ij X = Y - {Ij}
confidence = c support(Y)/ support (X)
So, have a threshold c and decide which ones you keep. (EASY.)
![Page 64: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/64.jpg)
Examples
Tid Items 1 {a,b,c} 2 {a,b,d} 3 {a,c} 4 {b,e,f}
Minimum support: 50 % itemsets {a,b} and {a,c}
Rules: a b with support 50 % and confidence 66.6 %
a c with support 50 % and confidence 66.6 %
c a with support 50% and confidence 100 %
b a with support 50% and confidence 100%
Assume s = 50 % and c = 80 %
![Page 65: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/65.jpg)
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
![Page 66: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/66.jpg)
The Apriori Algorithm — Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
![Page 67: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/67.jpg)
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
![Page 68: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/68.jpg)
Candidate generation (example)
C2 L2itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2 L2{1 2 3 }{1 3 5}{2 3 5}
C3
itemset{2 3 5}
Since {1,5} and {1,2} do not have enough support
![Page 69: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/69.jpg)
Is Apriori Fast Enough? — Performance Bottlenecks
• The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets– Use database scan and pattern matching to collect counts for the
candidate itemsets
• The bottleneck of Apriori: candidate generation– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.
– Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern
![Page 70: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/70.jpg)
Mining Frequent Patterns Without Candidate Generation
• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure– highly condensed, but complete for frequent pattern
mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method– A divide-and-conquer methodology: decompose
mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!
![Page 71: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/71.jpg)
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 0.5
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
![Page 72: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/72.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Summary
![Page 73: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/73.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Constraint-based association mining
• Summary
![Page 74: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/74.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Summary
![Page 75: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/75.jpg)
Interestingness Measurements
• Objective measuresTwo popular measurements: support; and confidence
• Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)
![Page 76: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/76.jpg)
Criticism to Support and Confidence
• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students
• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal
– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
![Page 77: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/77.jpg)
Criticism to Support and Confidence (Cont.)
• We need a measure of dependent or correlated events
• If Corr < 1 A is negatively correlated with B (discourages B)• If Corr > 1 A and B are positively correlated• P(AB)=P(A)P(B) if the itemsets are independent. (Corr =
1)• P(B|A)/P(B) is also called the lift of rule A => B (we want
positive lift!)
)(
)/(
)()(
)(, BP
ABP
BPAP
BAPcorr BA
![Page 78: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/78.jpg)
Chapter 6: Mining Association Rules in Large Databases
• Association rule mining
• Mining single-dimensional Boolean association rules from transactional databases
• Mining multilevel association rules from transactional databases
• Mining multidimensional association rules from transactional databases and data warehouse
• From association mining to correlation analysis
• Summary
![Page 79: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/79.jpg)
Why Is the Big Pie Still There?
• More on constraint-based mining of associations – Boolean vs. quantitative associations
• Association on discrete vs. continuous data
– From association to correlation and causal structure analysis.
• Association does not necessarily imply correlation or causal relationships
– From intra-trasanction association to inter-transaction associations
• E.g., break the barriers of transactions (Lu, et al. TOIS’99).
– From association analysis to classification and clustering analysis
• E.g, clustering association rules
![Page 80: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/80.jpg)
Summary
• Association rule mining – probably the most significant contribution from the
database community in KDD
– A large number of papers have been published
• Many interesting issues have been explored
• An interesting research direction– Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
![Page 81: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/81.jpg)
Business Miner http://www.businessobjects.comClementine http://www.isl.co.uk/clem.htmlDarwin http://www.oracle.com/ip/analyze/warehouse/datamining/Data Surveyor http:// www. ddi. nl/DBMiner http://db.cs.sfu.ca/DBMinerDelta Miner http://www.bissantz.de Decision Series http://www.neovista.comIDIS http://wwwdatamining.comIntelligent Miner http://www.software.ibm.com/data/intelli-mineMineSet http://www.sgi.com/software/mineset/MLC++ http://www.sgi.com/Technology/mlc/MSBN http://www.research.microsoft.com/research./dtg/msbnSuperQuery http://www.azmy.comWeka http://www.cs.waikato.ac.nz/ml/wekaApriori: http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html
Some Products and Free Soft available for association rule mining
![Page 82: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/82.jpg)
K-menas clustering
![Page 83: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/83.jpg)
Birch uses summary information – bonus question
![Page 84: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/84.jpg)
STUDY QUESTIONS
Some sample questions on data mining part. You may practice by yourself. No need to hand in. 1. Given transaction table:
TIDList of items
T1 1, 2, 5
T2 2, 4
T3 2,3
T4 1, 2, 4
T5 1, 3
T6 2, 3
T7 1, 3
T8 1, 2, 3, 5
T9 1, 2, 3
1)if min_sup = 2/9, apply apriori algorithm to get all the frequent itemsets, show the step.2)If min_con = 50%, show all the association rules generated from L3 (the large itemsets contains 3 items).
![Page 85: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/85.jpg)
STUDY QUESTIONS
2. Assume we have the following association rules with min_sup = s and min_con = c: A=>B (s1, c1) B=>C (s2,c2) C=>A (s3,c3)
Show the probability of P(A), P(B), P(C), P(AB), P(BC), P(AC), P(B|A), P(C|B), P(C|A)Show the conditions we can get A=>C
![Page 86: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/86.jpg)
STUDY QUESTIONS
. Given the following table
Outlook Temp Humidity Windy Classsunny 75 70 Y Playsunny 80 90 Y Don'tsunny 85 85 N Don'tsunny 72 95 N Don'tsunny 69 70 N Playovercast 72 90 Y Playovercast 83 78 N Playovercast 64 65 Y Playovercast 81 75 N Playrain 71 80 Y Don'train 65 70 Y Don'train 75 80 Y Playrain 68 80 N Playrain 70 96 N Play
Apply sprint algorithm to build decision tree. (The measure is gini)
![Page 87: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f505503460f94c738e9/html5/thumbnails/87.jpg)
STUDY QUESTIONS
4. Apply k-means to cluster the following 8 points to 3 clusters. The distance function is Euclidean distance. Assume initially we assign A1, B1, and C1 as the center of each cluster respectively. The 8 points are : A1(2,10), A2(2,5), A3(8,4) B1(5,8) B2(7,5), B3(6,4), C1(1,2), C2(4,9) Show - the three cluster centers after the first round execution.- the final three clusters.