approximate frequency counts over data streams gurmeet singh manku, rajeev motwani standford...
TRANSCRIPT
![Page 1: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/1.jpg)
Approximate Frequency Counts over Data Streams
Gurmeet Singh Manku, Rajeev MotwaniStandford University
VLDB2002
![Page 2: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/2.jpg)
Introduction
Data come as a continuous “stream”
Differs from traditional stored DB The sheer volume of a stream over its
lifetime is huge Queries require timely answer
![Page 3: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/3.jpg)
Frequent itemset mining on offline databases vs data streams
Often, level-wise algorithms are used to mine offline databases At least 2 database scans are needed
Ex: Apriori algorithm
Level-wise algorithms cannot be applied to mine data streams Cannot go through the data stream multipl
e times
![Page 4: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/4.jpg)
Challenges of streaming
Single pass
Limited Memory
Enumeration of itemsets
![Page 5: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/5.jpg)
Purpose
Present algorithms computing frequency exceeding threshold Simple Low memory footprint Output approximate, guaranteed not exceed a
user specified error parameter. Deployed for singleton items, handle variable
sized sets of items.
Main contributions of the paper: Proposed 2 algorithms to find frequent items appe
ar in a data stream of items Extended the algorithms to find frequent itemset
![Page 6: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/6.jpg)
Notations
Some notations: Let N denote the current length of the
stream Let s (0,1) denote the support
threshold Let (0,1) denote the error tolerance
<< s
![Page 7: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/7.jpg)
Approximation guarantees
All itemsets whose true frequency exceeds sN are reported
No itemset whose true frequency is less than (s-)N is output
Estimated frequencies are less than the true frequencies by at most N
![Page 8: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/8.jpg)
Example
s = 0.1%
ε should be one-tenth or one-twentieth of s. ε = 0.01%
Property 1, elements frequency exceeding 0.1% output.
Property 2, NO element frequency below 0.09% output
Elements between 0.09% ~ 0.1% may or may not be output.
Property 3, frequencies are less than their true frequencies at most 0.01%
![Page 9: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/9.jpg)
Problem definition
An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties
Devise algorithms support ε-deficient synopsis using little main memory as possible
![Page 10: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/10.jpg)
The Algorithms for frequent Items
Each transaction contains only 1 item
Two algorithms proposed: Sticky Sampling Algorithm Lossy Counting Algorithm
Features : Sampling used Frequency found approximate, error guaranteed not e
xceed user-specified tolerance level For Lossy Counting, all frequent items are reported
![Page 11: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/11.jpg)
Sticky Sampling Algorithm
Create counters by sampling
Stream341530
283141233519
![Page 12: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/12.jpg)
Sticky Sampling Algorithm
User input : Support threshold s Error tolerance Probability of failure
Counts kept in data structure S Each entry in S is in the form (e,f), where:
e : item f : frequency of e since the entry inserted in S
Output entries in S where f (s - )N
![Page 13: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/13.jpg)
Sticky Sampling Algorithm
r : sampling rate
Sampling an element with rate = r means select the element with probablity = 1/r
![Page 14: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/14.jpg)
Sticky Sampling Algorithm
Initially – S is empty, r = 1. For each incoming element e
if (e exists in S) increment corresponding f
else {sample element with rate r
if (sampled)add entry (e,1) to S
elseignore
}
![Page 15: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/15.jpg)
Sampling rate
Let t = 1/ ε log(s-1 -1) ( = probability of failure)
First 2t elements sampled at rate=1 The next 2t at rate=2 The next 4t at rate=4 and so on…
![Page 16: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/16.jpg)
Sticky Sampling Algorithm
Whenever the sampling rate r changes: for each entry (e,f) in S repeat {
toss an unbiased coinif (toss is not successful)
diminsh f by oneif (f == 0) {
delete entry from Sbreak
}} until toss is successful
![Page 17: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/17.jpg)
Lossy Counting
Data stream conceptually divided into buckets = 1/ transactions
Buckets labeled with bucket ids, starting from 1
Current bucket id is bcurrent ,value is N/ fe :true frequency of an element e in stream
seen so far Each entry in data structure D is form (e, f, )
e : item f : frequency of e : the maximum possible error in f
![Page 18: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/18.jpg)
Lossy Counting
is the maximum # of times e occurred in the first bcurrent – 1 buckets ( this value is exactly bcurrent – 1)
Once a value is inserted into D its value is unchanged
![Page 19: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/19.jpg)
Lossy Counting
Initially D is empty Receive element e
if (e exists in D)increment its frequency (f) by 1
elsecreate a new entry (e, 1, bcurrent – 1)
If bucket boundary prune D by the following the rule:(e,f,) is deleted if f + ≤ bcurrent
When the user requests a list of items with threshold s, output those entries in D where f ≥ (s – ε)N
![Page 20: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/20.jpg)
Lossy Counting
1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif
![Page 21: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/21.jpg)
Lossy Counting
FrequencyCounts
At window boundary, remove entries that for them f+∆ ≤ bcurrent
+
First WindowD is Empty
![Page 22: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/22.jpg)
Lossy CountingFrequencyCounts
At window boundary, remove entries that for them f+∆≤ bcurrent
Next Window
+
![Page 23: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/23.jpg)
Lossy Counting
Lossy Counting guarantees that: When deletion occurs, bcurrent N
Entry (e, f, ) is deleted, If fe bcurrent
fe : actual frequency count of e Hence, if entry (e, f, ) is deleted, fe N
Finally, f fe f + N
![Page 24: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/24.jpg)
Sticky Sampling vs Lossy Counting
Sticky Sampling is non-deterministic, while Lossy Counting is deterministic
Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling
![Page 25: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/25.jpg)
Sticky Sampling vs Lossy Counting
Lossy counting is superior by a large factor
Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled
Lossy counting is good at pruning low frequency elements quickly
![Page 26: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/26.jpg)
The more complex case: finding frequent itemsets
The Lossy Counting algorithm is extended to find frequent itemsets
Transactions in the data stream contains a set of items
![Page 27: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/27.jpg)
Finding frequent itemsets
Stream
![Page 28: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/28.jpg)
Finding frequent itemsets
Input: stream of transactions, each transaction is a set of items from I
N: length of the stream User specifies two parameters:
support s, error Challenge:
- handling variable sized transactions- avoiding explicit enumeration of all subsets of any transaction
![Page 29: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/29.jpg)
Finding frequent itemsets
Data structure D – set of entries of the form (set, f, ) set : subset of items
Transactions are divided into buckets = 1/ transactions : # of transactions
in each bucket bcurrent : current bucket id
![Page 30: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/30.jpg)
Finding frequent itemsets
Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions.
β : # of buckets in main memory in the current batch being processed.
![Page 31: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/31.jpg)
Finding frequent itemsets
D’s operations : UPDATE_SET updates and deletes in D
Entry (set, f, ) count occurrence of set in the batch and update the entry
If updated entry satisfies f + bcurrent, removed it from D
NEW_SET inserts new entries into D If set set has frequency f in batch and
set doesn’t occur in D, create a new entry (set, f, bcurrent-)
![Page 32: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/32.jpg)
Finding frequent itemsets
If fset ≥ N it has an entry in D
If (set,f,)ED then the true frequency of fset satisfies the inequality f≤ fset ≤ f+
When user requests list of items with threshold s, output in D where f ≥ (s-)N
β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.
![Page 33: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/33.jpg)
Buffer: repeatedly reads in a batch of buckets of transactions into available main memory
Trie: maintains the data structure D SetGen: generates subsets of item-id’s along
with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application
of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
![Page 34: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/34.jpg)
Three modules
BUFFER
TRIE
SUBSET-GEN
maintains the data structure D
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto available main memory
implement UPDATE_SET, NEW_SET
![Page 35: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/35.jpg)
Module 1 - Buffer
Read a batch of transactions Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transaction by its item-id’s
Window 1 Window 2 Window 3 Window 4 Window 5 Window 6
In Main Memory
![Page 36: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/36.jpg)
Module 2 - TRIE
50
40
30
31 29 32
45
42
50 40 30 31 29 45 32 42 Sets with frequency counts
![Page 37: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/37.jpg)
Module 2 – TRIE cont…
Nodes are labeled {item-id, f, , level} Children of any node are ordered by their item-
id’s Root nodes are also ordered by their item-id’s A node represents an itemset consisting of item-
id’s in that node and all its ancestors TRIE is maintained as an array of entries of the
form {item-id, f, , level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes.
No pointers, level’s compactly encode the underlying tree structure.
![Page 38: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/38.jpg)
Module 3 - SetGen
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1
Frequency countsof subsetsin lexicographic order
SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
![Page 39: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/39.jpg)
Overall Algorithm
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN
TRIE new TRIE
![Page 40: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/40.jpg)
Conclusion
Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items
Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic
Lossy Counting can be extended to find frequent itemsets
![Page 41: Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002](https://reader036.vdocuments.us/reader036/viewer/2022081513/56649e9a5503460f94b9ce8e/html5/thumbnails/41.jpg)
Thank you very much…