Download - Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi
![Page 1: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/1.jpg)
Duplicate Detection in Click Streams(2005)SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi
Tian Wang
![Page 2: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/2.jpg)
Goal of the paper•Detect Duplicate click.
Approach•Sliding window, Landmark window, Jumping windows.•Bloom Filters Algorithm.
![Page 3: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/3.jpg)
Sliding Windows•Query latest N click in the stream.•Queue Structure
1 2 3 4 5 6
1 2 3 4 5 6
![Page 4: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/4.jpg)
Landmark windows•Handle disjoint portion of data streams.•Landmarks: time or number.•Space saving vs. Cross Windows
8:00 9:00 10:00 11:00 12:00 13:00
1,2,3,4 5,6,7,8 9,10,11 11,12,13 13,14,15 16,17,18
50clicks 50clicks 50clicks 50clicks 50clicks 50clicks
![Page 5: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/5.jpg)
Jumping windows•Compromise between landmark windows and sliding windows.•Maintain n sub-windows.•Populate the latest sub-window and delete the eldest sub-window.
![Page 6: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/6.jpg)
Algorithm for checking the duplicate clicks
Basic approach•Scan the entire window and compare.•Window of size N requires O(N2) comparisons.
Basic approach with index•Reduce the search cost to O(log(N)) per element.•Increase the element insertion cost to O(log(N)).
Goal of new algorithm•Detect all duplicate.•Less space.•Quick processing.•Few false positive error.
![Page 7: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/7.jpg)
Bit Vector Algorithm
•Assume unique click with 32bits.•Keep a bit vector of length 2^32 bits.
0…..0(32) ------- 1….1(32)•takes O(1) steps and space to insert a new element into the bit vector, or to check it for duplication.•Disadvantage: a larger click with 512 bit.
![Page 8: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/8.jpg)
Bit Vector Algorithm(modification)
•Keep partial information p bits, 1 ≤ p ≤ b, b is the length of a vector•The nth bit vector has bits for all the combinations of bits (n−1) . . . (n+p−2). There are b−p+1 bit vectors utilized.
•Bits used:
![Page 9: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/9.jpg)
Potential Problems
Probabilistic analysis
•Any two elements picked at random have a probability of 2^−p that they will collide in any bit vector of length p.•a, and b have the same values in bits (n−1) to (n+p−2), then there is a probability of 1/2 that the values of the bits starting at n to (n+p−1) will be equal.
Goal•Achieve better result.•Facilitate the probabilistic analysis.
![Page 10: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/10.jpg)
Bloom Filter
General idea•Two set X Y, check every element belong to Y.•O(|X|) operations, and O(|Y|) space
Data Structure
•Array of M cells.•Element y using d independent hash function to addresses y1, y2, . . . , yd, which are set to 1, such that 0 ≤ yi ≤ M − 1, i.∀•X use same hash manner.•If all set to 1, there is a good probability that x Y.∈
![Page 11: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/11.jpg)
Bloom Filter for Duplicate Detection
•test every new element on the Bloom Filter structure of the previously observed elements, and then insert it into the Bloom Filter structure.•The element is not counted as a duplicate if at least 1 bit was switched from 0 to 1, and is considered to be a duplicate otherwise.
![Page 12: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/12.jpg)
Bloom Filter for Sliding Windows
•Counting Bloom Filters.•Delete expired element.•Use an integer in a cell represents the number of elements which hash to this cell.
![Page 13: Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi](https://reader036.vdocuments.us/reader036/viewer/2022062423/568148b5550346895db5cbd6/html5/thumbnails/13.jpg)
Experiment
•Landmark and jumping windows•Tradeoff between space, time and error rate.•The numbers of hash function used.