building wavelet histograms on large data in...
TRANSCRIPT
![Page 1: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/1.jpg)
Building Wavelet Histogramson Large Data in MapReduce
Jeffrey Jestes1 Ke Yi2 Feifei Li1
1School of Computing 2Department of Computer ScienceUniversity of Utah Hong Kong University of Science and Technology
November 16, 2011
![Page 2: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/2.jpg)
Introduction
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
For large data we often wish to obtain a concise summary.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 3: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/3.jpg)
Introduction
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
For large data we often wish to obtain a concise summary.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 4: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/4.jpg)
Introduction
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
For large data we often wish to obtain a concise summary.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 5: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/5.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 6: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/6.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.
Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 7: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/7.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 8: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/8.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.
Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 9: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/9.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.
Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 10: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/10.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).
A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 11: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/11.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 12: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/12.jpg)
Introduction: Histograms
A widely accepted and utilized summarization tool is the Histogram.Let A be an attribute of dataset R.
Values of A are drawn from finite domain [u] = {1, · · · , u}.Define for each x ∈ {1, · · · , u}, v(x) = {count(R.A = x)}.Define frequency vector v of R.A as v = (v(1), . . . , v(u)).A histogram over R.A is any compact (lossy) representation of v.
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 13: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/13.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 14: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/14.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 15: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/15.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
Original data signal at level ` = log2 u.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 16: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/16.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
Compute detail coefficients ci andaverage coefficients ai for level ` = 2.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 17: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/17.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
Compute detail coefficients ci andaverage coefficients ai for level ` = 2.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 18: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/18.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
a4 = (v(2) + v(1))/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 2.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 19: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/19.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c4 = (v(2)− v(1))/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 2.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 20: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/20.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
Compute detail coefficients ci andaverage coefficients ai for level ` = 1.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 21: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/21.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
Compute detail coefficients ci andaverage coefficients ai for level ` = 1.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 22: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/22.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
a2 = (a5 + a4)/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 1.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 23: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/23.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c2 = (a5 − a4)/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 1.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 24: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/24.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
Compute detail coefficients ci andaverage coefficients ai for level ` = 0.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 25: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/25.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
Compute detail coefficients ci andaverage coefficients ai for level ` = 0.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 26: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/26.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
a1 = (a3 + a2)/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 0.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 27: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/27.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
c1 = (a3 − a2)/2
Compute detail coefficients ci andaverage coefficients ai for level ` = 0.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 28: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/28.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
The wavelet coefficients wi
are [ a1, c1, . . . , cu−1 ].
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 29: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/29.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
The wavelet coefficients wi
are [ a1, c1, . . . , cu−1 ].
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 30: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/30.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
c4 c5 c6 c7 ` = 21 4 a4 a5 a6 a72 120 2-1 9
c2 a22.5 6.5 c3 a35 7 ` = 1
c1 a10.3 6.8
a1 6.8 total average
` = 0
w1
w2
w3 w4
w5 w6 w7 w8
The wavelet coefficients wi
are [ a1, c1, . . . , cu−1 ].
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 31: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/31.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
One scales wi by√
u/2` to preserve energy, i.e.‖v‖22 = ∑u
i=1 v(i)2 = ∑ui=1 w 2
i
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 32: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/32.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
scale by:√
u/2`
` = 2
` = 1
` = 0
1 4
One scales wi by√
u/2` to preserve energy, i.e.‖v‖22 = ∑u
i=1 v(i)2 = ∑ui=1 w 2
i
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 33: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/33.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
scale by:√
u/2`
One scales wi by√
u/2` to preserve energy, i.e.‖v‖22 = ∑u
i=1 v(i)2 = ∑ui=1 w 2
i
` = 3
w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9
w3 a25 6.5 w4 a310 7
w2 a10.8 6.8
w1 19 total average
` = 2
` = 1
` = 0
1.4 4
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 34: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/34.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
scale by:√
u/2`
` = 3
w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9
w3 a25 6.5 w4 a310 7
w2 a10.8 6.8
w1 19 total average
` = 2
` = 1
` = 0
1.4 4
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
Select top-k wi in the absolute value to obtainbest k-term representation minimizing error inenergy, i.e. minimize ∑u
i=1 v(i)2 − ∑ui=1 w 2
i
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 35: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/35.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
scale by:√
u/2`
` = 3
w5 w6 w7 w8a4 a5 a6 a72.8 120 2-1.4 9
w3 a25 6.5 w4 a310 7
w2 a10.8 6.8
w1 19 total average
` = 2
` = 1
` = 0
1.4 4
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
Select top-k wi in the absolute value to obtainbest k-term representation minimizing error inenergy, i.e. minimize ∑u
i=1 v(i)2 − ∑ui=1 w 2
i
k = 3
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 36: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/36.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
scale by:√
u/2`
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
k = 3
w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?
w3 a25 ? w4 a310 ?
w2 a10 ?
w1 19 total average
0 ? ` = 2
` = 1
` = 0
We maintain the best k-term wi .Other wi are treated as 0.
? ? ? ? ? ? ? ? ` = 3
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 37: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/37.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
scale by:√
u/2`
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
k = 3
w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?
w3 a25 ? w4 a310 ?
w2 a10 ?
w1 19 total average
0 ? ` = 2
` = 1
` = 0
The error in energy is∑ui=1 v(i)2 − ∑u
i=1 w 2i = 502− 489.5 = 12.5.
? ? ? ? ? ? ? ? ` = 3
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 38: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/38.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
scale by:√
u/2`
energy: ∑ui=1 v(i)2 = ∑u
i=1 w 2i
k = 3
w5 w6 w7 w8a4 a5 a6 a70 ?0 ?0 ?
w3 a25 ? w4 a310 ?
w2 a10 ?
w1 19 total average
0 ? ` = 2
` = 1
` = 0
? ? ? ? ? ? ? ? ` = 3
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
To reconstruct the original signal we computethe average and difference coefficients in re-verse, i.e. top to bottom.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 39: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/39.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
4.3 4.3 9.3 9.3 1.8 1.8 12 12 ` = 3
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
` = 3
w5 w6 w7 w8a4 a5 a6 a70 120 1.80 9.3
w3 a22.5 6.8 w4 a35 6.8
w2 a10 6.8
` = 2
` = 1
` = 0
0 4.3
w1 6.8 total average
The reconstructed signal is a reasonably closeapproximation.
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 40: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/40.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 41: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/41.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 42: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/42.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 43: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/43.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 44: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/44.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 45: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/45.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 46: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/46.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 47: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/47.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 48: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/48.jpg)
Introduction: Wavelet Histograms
A common choice for a histogram is the Haar wavelet histogram.
We obtain the Haar wavelet coefficients wi recursively as follows:
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
All wi can be calculated in O(u log u) time:1. We maintain O(log u) partial wis at a time.2. Compute affected wi and contribution from
each v(x) in O(log u) time.2. Process v(x)s in sorted order. [GKMS01]
Affectedwi
[GKMS01] A.C. Gilbert, et al. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 49: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/49.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
wi = v · ψi for i = 1, . . . , u
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 50: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/50.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 51: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/51.jpg)
Introduction: MapReduce and Hadoop
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
R
Traditionally data is stored in a centralized setting.
Now stored data has sky rocketed, and is increasingly distributed.We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 52: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/52.jpg)
Introduction: MapReduce and Hadoop
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
R1
R2 R3
R4
R = R1 ∪ R2 ∪ R3 ∪ R4
R
Traditionally data is stored in a centralized setting.Now stored data has sky rocketed, and is increasingly distributed.
We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 53: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/53.jpg)
Introduction: MapReduce and Hadoop
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
R1
R2 R3
R4
R = R1 ∪ R2 ∪ R3 ∪ R4
R
Traditionally data is stored in a centralized setting.Now stored data has sky rocketed, and is increasingly distributed.We study computing the top-k coefficients efficiently on distributeddata in MapReduce using Hadoop to illustrate our ideas.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 54: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/54.jpg)
Background: Hadoop Distributed File System (HDFS)
Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 55: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/55.jpg)
Background: Hadoop Distributed File System (HDFS)
Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).
Record ID User ID Object ID . . .1 1 12872 . . .2 8 19832 . . .3 4 231 . . .... ... ... ...
NameNode
DataNodes
R
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 56: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/56.jpg)
Background: Hadoop Distributed File System (HDFS)
Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).
NameNode
DataNodes
64MB 64MB 64MB 64MB
R’s DataChunks (Splits)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 57: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/57.jpg)
Background: Hadoop Distributed File System (HDFS)
Hadoop requires a Distributed File System (DFS), we utilize theHadoop Distributed File System (HDFS).
NameNode
DataNodes
64MB 64MB 64MB 64MB
R’s DataChunks (Splits)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 58: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/58.jpg)
Background: Hadoop Core
Hadoop Core consists of one master JobTracker and severalTaskTrackers.
We assume one TaskTracker per physical machine.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 59: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/59.jpg)
Background: Hadoop Core
Hadoop Core consists of one master JobTracker and severalTaskTrackers.
We assume one TaskTracker per physical machine.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 60: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/60.jpg)
Background: Hadoop Core
Hadoop Core consists of one master JobTracker and severalTaskTrackers.
We assume one TaskTracker per physical machine.JobTracker
TaskTrackers
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 61: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/61.jpg)
Background: Hadoop Core
Hadoop Core consists of one master JobTracker and severalTaskTrackers.
We assume one TaskTracker per physical machine.JobTracker
TaskTrackers
Job SchedulingMapper Task SchedulingReduce Task Scheduling
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 62: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/62.jpg)
Background: Hadoop Core
Hadoop Core consists of one master JobTracker and severalTaskTrackers.
We assume one TaskTracker per physical machine.JobTracker
TaskTrackers
Job SchedulingMapper Task SchedulingReduce Task Scheduling
MappersReducers
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 63: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/63.jpg)
Background: Hadoop Cluster
In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.
The other machines run DataNode and TaskTracker tasks and arecalled slaves.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 64: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/64.jpg)
Background: Hadoop Cluster
In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.
The other machines run DataNode and TaskTracker tasks and arecalled slaves.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 65: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/65.jpg)
Background: Hadoop Cluster
In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.
The other machines run DataNode and TaskTracker tasks and arecalled slaves.
DataNodes + TaskTrackers
NameNode + JobTracker
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 66: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/66.jpg)
Background: Hadoop Cluster
In a Hadoop cluster one machine typically runs both the NameNodeand JobTracker tasks and is called the master.
The other machines run DataNode and TaskTracker tasks and arecalled slaves.
DataNodes + TaskTrackers
NameNode + JobTracker
Master
Slaves
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 67: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/67.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
Distributed Cache
Job Configuration
Next we look at an overview of a typical MapReduce Job.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 68: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/68.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
Distributed Cache
Job Configuration
Job specific variables are first placed in the Job Configuration whichis sent to each Mapper Task by the JobTracker.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 69: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/69.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
Distributed Cache
Job Configuration
Large data such as files or libraries are then put in the DistributedCache which is copied to each TaskTracker by the JobTracker.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 70: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/70.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
Distributed Cache
Job Configuration
The JobTracker next assigns each InputSplit to a Mapper task on aTaskTracker, we assume m Mappers and m InputSplits.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 71: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/71.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Distributed Cache
Job Configuration
Each Mapper maps a (k1, v1) pair to an intermediate (k2, v2) pairand partitions by k2, i.e. hash(k2) = pi for i ∈ [1, r ], r = |reducers|.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 72: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/72.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
Distributed Cache
Job Configuration
An optional Combiner is executed over (k2, list(v2)).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 73: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/73.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Distributed Cache
Job Configuration
The Combiner aggregates v2 for a k2 and a (k2, v2) is written to apartition on disk.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 74: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/74.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Reducer
Reducer
Shuffle/Sort Phase
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Distributed Cache
Job Configuration
The JobTracker assigns two TaskTrackers to run the Reducers, eachReducer copies and sorts it’s inputs from corresponding partitions.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 75: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/75.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Reducer
Reducer
Shuffle/Sort Phase
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Combiners reduce communication overhead!Distributed Cache
Job Configuration
The JobTracker assigns two TaskTrackers to run the Reducers, eachReducer copies and sorts it’s inputs from corresponding partitions.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 76: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/76.jpg)
Background: MapReduce Job Overview
split 1split 2split 3split 4
(k1, v1)
(k1, v1)
(k1, v1)
(k1, v1)
JobTracker
Mapper
Mapper
Mapper
Mapper
Map Phase
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Reducer
Reducer
Shuffle/Sort Phase
(k3, v3)
(k3, v3)
o1
o2
Reduce Phase
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, list(v2))
Combiner
(k2, v2)
(k2, v2)
(k2, v2)
(k2, v2)
p1p2
p1p2
p1p2
p1p2
Distributed Cache
Job Configuration
Each Reducer reduces a (k2, list(v2)) to a single (k3, v3) and writesthe results to a DFS file, oi for i ∈ [1, r ].
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 77: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/77.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 78: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/78.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
Each of the m Mappers reads the input key x from its input split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 79: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/79.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
Each Mapper emits (x , 1) for combining by the Combiner.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 80: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/80.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Each Combiner emits (x , vj(x)), where vj(x) is the local frequencyof x .
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 81: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/81.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Reducer
Centralized Wavelet Top-k
(x , v(x))
The Reducer utilizes a Centralized Wavelet Top-k algorithm,supplying the (x , v(x)) in a streaming fashion.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 82: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/82.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Reducer
Centralized Wavelet Top-k
close()
At the end of the Reduce phase, the Reducer’s close() method isinvoked. The Reducer then requests the top-k |wi |.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 83: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/83.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Reducer
Centralized Wavelet Top-k
top-k |wi |
The centralized algorithm computes the top-k |wi | and returns theseto the Reducer.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 84: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/84.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Reducer
Centralized Wavelet Top-k
top-k |wi |
o1top-k |wi |
Finally, the Reducer writes the top-k |wi | to its HDFS output file o1.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 85: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/85.jpg)
Exact Top-k Wavelet Coefficients: Naive Solution
split 1split 2split 3split 4
(x , null)
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
(x , 1)
Combiner
Combiner
Combiner
Combiner
p1
p1
p1
p1
(x , vj(x))
(x , vj(x))
(x , vj(x))
(x , vj(x))
Reducer
Centralized Wavelet Top-k
top-k |wi |
o1top-k |wi |
Too Expensive!!!O(n) Communication!!!
n = Total records in input file.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 86: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/86.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 87: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/87.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 88: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/88.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
split 1w1,1
w2,1
w3,1...
wu,1
split 2w1,2
w2,2
w3,2...
wu,2
split 3w1,3
w2,3
w3,3...
wu,3
split 4w1,4
w2,4
w3,4...
wu,4
Coordinator
m = 4
wi ,j is the local value of wi in split j .
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 89: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/89.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
split 1w1,1
w2,1
w3,1...
wu,1
split 2w1,2
w2,2
w3,2...
wu,2
split 3w1,3
w2,3
w3,3...
wu,3
split 4w1,4
w2,4
w3,4...
wu,4
Coordinator
m = 4
wi ,1wi ,2 wi ,3
wi ,4
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 90: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/90.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
split 1w1,1
w2,1
w3,1...
wu,1
split 2w1,2
w2,2
w3,2...
wu,2
split 3w1,3
w2,3
w3,3...
wu,3
split 4w1,4
w2,4
w3,4...
wu,4
Coordinator
m = 4
wi ,1wi ,2 wi ,3
wi ,4
wi =m∑
j=1wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 91: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/91.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
split 1w1,1
w2,1
w3,1...
wu,1
split 2w1,2
w2,2
w3,2...
wu,2
split 3w1,3
w2,3
w3,3...
wu,3
split 4w1,4
w2,4
w3,4...
wu,4
Coordinator
m = 4
wi ,1wi ,2 wi ,3
wi ,4
wi =m∑
j=1wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 92: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/92.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
We can try to model the problem as a distributed top-k:wi = v · ψi = (
∑mj=1 vj) · ψi =
∑mj=1 wi,j .
Previous solutions assume local score si,j ≥ 0 and want the largestsi =
∑mj=1 si,j .
We have wi,j < 0 and wi,j ≥ 0 and want the largest |wi |.
split 1w1,1
w2,1
w3,1...
wu,1
split 2w1,2
w2,2
w3,2...
wu,2
split 3w1,3
w2,3
w3,3...
wu,3
split 4w1,4
w2,4
w3,4...
wu,4
Coordinator
m = 4
wi ,1wi ,2 wi ,3
wi ,4
wi =m∑
j=1wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 93: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/93.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 94: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/94.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0
• An item x has a local score si(x) at node i ∀i ∈ [1 . . .m], where ifx does not appear si(x) = 0
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 95: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/95.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 011 42 -40 0
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
• Each node sends:the top-k most positive scored itemsthe top-k most negative scored items.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 96: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/96.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• The coordinator computes useful bounds for each received item.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 97: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/97.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• s(x) denotes the partial score sum for x
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 98: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/98.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• s(x) denotes the partial score sum for x
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 99: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/99.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• Fx is a receipt indication bit vector,if si(x) is received Fx(i) = 1,else Fx(i) = 0.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 100: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/100.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• Fx is a receipt indication bit vector,if si(x) is received Fx(i) = 1,else Fx(i) = 0.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 101: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/101.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ+(x) is an upper bound on the total score s(x),if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + k ’th most positive from node i
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 102: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/102.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ+(x) is an upper bound on the total score s(x),if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + k ’th most positive from node i
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 103: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/103.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ−(x) is a lower bound on the total score sum s(x),if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x) + k ’th most negative score from node i
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 104: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/104.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ−(x) is a lower bound on the total score sum s(x),if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x) + k ’th most negative score from node i
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 105: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/105.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ (x) is a lower bound on |s(x)| computed as,τ (x) = 0 if sign(τ+(x)) 6= sign(τ−(x))τ (x) = min(|τ+(x)|, |τ−(x)|) otherwise.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 106: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/106.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
• τ (x) is a lower bound on |s(x)| computed as,τ (x) = 0 if sign(τ+(x)) 6= sign(τ−(x))τ (x) = min(|τ+(x)|, |τ−(x)|) otherwise.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 107: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/107.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
•We select the item with the kth largest τ (x).τ (x) is a lower bound T1 on the top-k |s(x)| for unseen item x .
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 108: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/108.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
• Any unseen item x must have at least:one si(x) > T1/m orone si(x) < −T1/mTo get into the top-k .
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 109: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/109.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
Round 1 End
√
√
√
√
√
√
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 110: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/110.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
• T1/m sent to each site.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 111: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/111.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
• Each site finds items withsi(x) > T1/m orsi(x) < T1/m.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 112: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/112.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rid x sj(x)e1,1 5 20e1,6 3 -30e2,1 5 12e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
• Items with |si(x)| > T1/m are sentto coordinator.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 113: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/113.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
• Items with |si(x)| > T1/m are sentto coordinator.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 114: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/114.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 115: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/115.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
• Partial score sum s(5) = 20 + 12
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 116: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/116.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
• Receipt vector F5 = [110]
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 117: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/117.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
• τ+(x) is now tighter,if si(x) received then τ+(x) = τ+(x) + si(x)else τ+(x) = τ+(x) + T1/m
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 118: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/118.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
• τ−(x) is also tighter,if si(x) received then τ−(x) = τ−(x) + si(x)else τ−(x) = τ−(x)− T1/m
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 119: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/119.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• The coordinator updates the bounds foreach item it has ever received.
• Score absolute value bound τ (5) = min(39.3, 24.6).
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 120: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/120.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• τ ′(x) is an upper bound on |s(x)|,τ ′(x) = max{|τ+(x)|, |τ−(x)|}
• The coordinator updates the bounds foreach item it has ever received.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 121: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/121.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
• τ ′(x) is an upper bound on |s(x)|,τ ′(x) = max{|τ+(x)|, |τ−(x)|}
• The coordinator updates the bounds foreach item it has ever received.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 122: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/122.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
•We select the item x with the kth largest τ (x), which serves as anew lower bound T2 on |s(x)| for any item.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 123: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/123.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
•We select the item x with the kth largest τ (x), which serves as anew lower bound T2 on |s(x)| for any item.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 124: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/124.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
• Any item with τ ′(x) < T2 cannot be in the top-k and is pruned fromR .
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 125: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/125.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
• Any remaining items with a 0 in vector Fx are selected.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 126: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/126.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
Round 2 End
√ √
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 127: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/127.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
√ √
s3(3) =?
• The coordinator asks for missing scoresfor items still in R .
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 128: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/128.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,6 6 -10
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
√ √
s3(3) =?
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 129: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/129.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
√ √
s3(3) =?
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 130: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/130.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
Rx s(x) Fx τ+(x) τ−(x) τ (x) τ ′(x)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
T2 = 45
√ √
s3(3) =?
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 131: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/131.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
T2 = 45
√ √
s3(3) =?
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111
• After collecting all scores, the coordinatorcan determine top-k |s(x)|.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 132: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/132.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
T2 = 45
√ √
s3(3) =?
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111
• After collecting all scores, the coordinatorcan determine top-k |s(x)|.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 133: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/133.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
T2 = 45
√ √
s3(3) =?
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111
s(6) = −45?
• After collecting all scores, the coordinatorcan determine top-k |s(x)|.
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 134: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/134.jpg)
Exact Top-k Wavelet Coefficients: Our Solution
node 1id x s1(x)e1,1 5 20e1,2 2 7e1,3 1 6e1,4 4 -2e1,5 6 -15e1,6 3 -30
node 2id x s2(x)e2,1 5 12e2,2 4 7e2,3 1 2e2,4 2 -5e2,5 3 -14e2,6 6 -20
node 3id x s3(x)e3,1 1 10e3,2 3 6e3,3 4 5e3,4 2 -3e3,5 5 -6e3,6 6 -10
T1 = 22, T1/m = 22/3
√
√
√
√
√
√
T2 = 45
√ √
Rid x sj(x)e1,1 5 20e1,5 6 -15e1,6 3 -30e2,1 5 12e2,5 3 -14e2,6 6 -20e3,1 1 10e3,2 3 6e3,6 6 -10
Rx s(x) Fi1 10 0013 -38 1115 32 1106 -45 111
s(6) = −45?
Round 3 End
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 135: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/135.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 136: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/136.jpg)
Approximate Top-k Wavelet Coefficients
Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.
If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds3 amount of I/O incurred
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 137: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/137.jpg)
Approximate Top-k Wavelet Coefficients
Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.
If we are allowed an approximation, we could further improve:
1 communication cost2 number of MapReduce rounds3 amount of I/O incurred
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 138: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/138.jpg)
Approximate Top-k Wavelet Coefficients
Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.
If we are allowed an approximation, we could further improve:1 communication cost
2 number of MapReduce rounds3 amount of I/O incurred
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 139: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/139.jpg)
Approximate Top-k Wavelet Coefficients
Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.
If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds
3 amount of I/O incurred
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 140: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/140.jpg)
Approximate Top-k Wavelet Coefficients
Hadoop Wavelet Top-k is a good solution if the exact top-k |wi |must be retrieved, but requires multiple phases.
If we are allowed an approximation, we could further improve:1 communication cost2 number of MapReduce rounds3 amount of I/O incurred
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 141: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/141.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:
1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 142: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/142.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.
2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 143: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/143.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 144: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/144.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.
The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 145: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/145.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].
The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 146: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/146.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 147: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/147.jpg)
Approximate Top-k Wavelet Coefficients
Some natural improvement attempts:1 Approximate distributed top-k.2 Approximating local coefficients with a linearly combinable sketch.
For set A = A1 ∪ A2,Sketch(A) = Sketch(A1) op Sketch(A2) for operator op.The state of the art wavelet sketch is the GCS Sketch [CGS06].The GCS gives us, for v = v1 + v2GCS(v) = GCS(v1) + GCS(v2)
3 Random sampling techniques.
[GCS06] G. Cormode, et al. Fast approximate wavelet tracking on streams. In EDBT, 2006.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 148: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/148.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 149: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/149.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 150: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/150.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
(x , 1)
In-MemoryMap
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 151: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/151.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
In-MemoryMap
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 152: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/152.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
In-MemoryMap
(x , vj(x))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 153: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/153.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
In-MemoryMap
(x , vj(x))
Sketch(x , vj(x))
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 154: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/154.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
p1
In-MemoryMap
(x , vj(x))
Sketch(x , vj(x))
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 155: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/155.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
Reducerp1
In-MemoryMap
wavelet sketch(x , vj(x))
Sketch(x , vj(x))
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 156: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/156.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
Reducerp1
In-MemoryMap
wavelet sketch(x , vj(x))
Combine Sketches
wavelet sketch
Sketch(x , vj(x))
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 157: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/157.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
Reducerp1
In-MemoryMap
wavelet sketch(x , vj(x))
Combine Sketches
wavelet sketch
Sketch(x , vj(x))
Wavelet Sketch top-k
close()
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 158: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/158.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
Reducerp1
In-MemoryMap
wavelet sketch(x , vj(x))
Combine Sketches
wavelet sketch
Sketch(x , vj(x))
Wavelet Sketch top-k
top-k |wi |close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 159: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/159.jpg)
Approximate Top-k Wavelet Coefficients: Sketch
split 1split 2split 3split 4
JobTracker
(x , null)Mapper
wavelet sketch
Reducerp1
In-MemoryMap
wavelet sketch(x , vj(x))
Combine Sketches
wavelet sketch
o1top-k |wi |
Sketch(x , vj(x))
Wavelet Sketch top-k
top-k |wi |close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 160: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/160.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 161: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/161.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
nj Records in split j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 162: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/162.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
nj Records in split j
Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 163: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/163.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Node j samples tj = nj · p records where p = 1/ε2n.
nj Records in split j
Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 164: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/164.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Node j samples tj = nj · p records where p = 1/ε2n.
nj Records in split j
Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 165: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/165.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
sj(x): Sampled Frequency Counts
Node j samples tj = nj · p records where p = 1/ε2n.
nj Records in split j
Sample
Well known fact: to approximate each v(x) with standarddeviation σ = O(εn) a sample of size Θ(1/ε2) is required.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 166: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/166.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
sj(x): Sampled Frequency Counts
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 167: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/167.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
sj(x): Sampled Frequency Counts
Emitted sj(x)
Emit
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 168: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/168.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
sj(x): Sampled Frequency Counts
Emitted sj(x)
Emit
Coordinator
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 169: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/169.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Coordinator
Emitted sm(x)
Emitted s1(x)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 170: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/170.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189104
176215433451
20121356
Emitted s1(x)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 171: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/171.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189104
176215433451
20121356
Construct
v(x)
Emitted s1(x)2345/p
1897/p
1673/p
189/p
10/p
4/p
1762/p
1543/p
3451/p
20/p
12/p
1356/p
v(x) = s(x)/p is v(x)’sunbiased estimator
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 172: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/172.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 173: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/173.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 174: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/174.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 175: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/175.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 176: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/176.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 177: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/177.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
Note: ε must be small for v to approximate v well.
Typical values for ε are 10−4 to 10−6.
The communication for basic sampling is O(1/ε2).
With 1 byte keys, 100MB to 1TB of data must be communicated!
We improve basic random sampling with Improved Sampling.
Key idea: ignore sampled keys with small frequencies in a split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 178: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/178.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 179: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/179.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
nj Records in split
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 180: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/180.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 181: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/181.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 182: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/182.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
sj(x): Sampled Frequency Counts
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
Sample
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 183: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/183.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
sj(x): Sampled Frequency Counts
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 184: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/184.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
sj(x): Sampled Frequency Counts
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 185: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/185.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
sj(x): Sampled Frequency Counts
Emitted sj(x)
Emit
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 186: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/186.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
sj(x): Sampled Frequency Counts
Emitted sj(x)
Emit
Coordinator
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Node j sends (x , sj(x)) only if sj(x) > εtj .
• The error in s(x) is ≤ ∑mj=1 εtj = εpn = 1/ε.
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 187: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/187.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Emitted s1(x)
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 188: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/188.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189
176215433451
1356
Emitted s1(x)
234
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 189: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/189.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189
176215433451
1356
Construct
v(x)
Emitted s1(x)
v(x) = s(x)/p
2345/p
1897/p
1673/p
189/p
1762/p
1543/p
3451/p
1356/p
234 234/p
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 190: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/190.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189
176215433451
1356
Construct
v(x)
Emitted s1(x)
v(x) = s(x)/p
2345/p
1897/p
1673/p
189/p
1762/p
1543/p
3451/p
1356/p
234 234/p
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 191: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/191.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189
176215433451
1356
Construct
v(x)
Emitted s1(x)
v(x) = s(x)/p
2345/p
1897/p
1673/p
189/p
1762/p
1543/p
3451/p
1356/p
234 234/p
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 192: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/192.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
Coordinator
Emitted sm(x)
Construct
s(x)
s(x) = ∑mj=1 sj(x)
234518971673
189
176215433451
1356
Construct
v(x)
Emitted s1(x)
v(x) = s(x)/p
2345/p
1897/p
1673/p
189/p
1762/p
1543/p
3451/p
1356/p
234 234/p
Each node sends at most tj/(εtj) = 1/ε keys.
The total communication is O(m/ε).
E[v(x)] may be εn away from v(x) as sj(x) < εtj are ignored.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 193: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/193.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 194: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/194.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
nj Records in split
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 195: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/195.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 196: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/196.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 197: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/197.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
sj(x): Sampled Frequency Counts
Node j samples tj = nj · p records using
Basic Sampling, where p = 1/ε2n.
nj Records in split
Sample
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 198: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/198.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
sj(x): Sampled Frequency Counts
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 199: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/199.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
• Else emit (x , null) with probability ε√m · sj(x).
sj(x): Sampled Frequency Counts
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 200: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/200.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
• Else emit (x , null) with probability ε√m · sj(x).
sj(x): Sampled Frequency Counts
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 201: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/201.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
• Else emit (x , null) with probability ε√m · sj(x).
sj(x): Sampled Frequency Counts
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 202: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/202.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
• Else emit (x , null) with probability ε√m · sj(x).
sj(x): Sampled Frequency Counts
sj(x): sampled sj(x)
nullnull null
Emit
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 203: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/203.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Sample record x with probability min{ε√m · sj(x), 1}.• If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
• Else emit (x , null) with probability ε√m · sj(x).
sj(x): Sampled Frequency Counts
sj(x): sampled sj(x)
nullnull null
Emit
Coordinator
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 204: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/204.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 205: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/205.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
Construct
s(x)
s(x): estimator of s(x)
234518971673
18953
176215433451
237431356
To construct s(x).
If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 206: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/206.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
Construct
s(x)
s(x): estimator of s(x)
234518971673
18953
176215433451
237431356
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).
Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 207: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/207.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
Construct
s(x)
s(x): estimator of s(x)
234518971673
18953
176215433451
237431356
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 208: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/208.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
Construct
s(x)
s(x): estimator of s(x)
234518971673
18953
176215433451
237431356
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 209: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/209.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Coordinator
s1(x): sampled s1(x)
nullnull null
sm(x): sampled sm(x)
null
null
null
Construct
s(x)
s(x): estimator of s(x)
234518971673
18953
176215433451
237431356
Construct
v(x)
v(x) = s(x)/p
2345/p
1897/p
1673/p
189/p
53/p
1762/p
1543/p
3451/p
237/p
43/p
1356/p
To construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , null) received, M(x) = M(x) + 1.
Finally, s(x) = ρ(x) + M(x)/ε√
m.
Then, v(x) = s(x)/p is an unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 210: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/210.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.
Corollary
v(x) is an unbiased estimator of v(x) with standard deviation at most εn.
Theorem
• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/
√u/2j where φ is
a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The
variance of wi is bounded by ε2jnu√m
∑(2k+2)u/2j+1
x=2ku/2j+1+1s(x).
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 211: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/211.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.
Corollary
v(x) is an unbiased estimator of v(x) with standard deviation at most εn.
Theorem
• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/
√u/2j where φ is
a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The
variance of wi is bounded by ε2jnu√m
∑(2k+2)u/2j+1
x=2ku/2j+1+1s(x).
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 212: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/212.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.
Corollary
v(x) is an unbiased estimator of v(x) with standard deviation at most εn.
Theorem
• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/
√u/2j where φ is
a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The
variance of wi is bounded by ε2jnu√m
∑(2k+2)u/2j+1
x=2ku/2j+1+1s(x).
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 213: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/213.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x) with standard deviation at most 1/ε.
Corollary
v(x) is an unbiased estimator of v(x) with standard deviation at most εn.
Theorem
• wi is an unbiased estimator for any wi .• Recall wi = 〈v, ψi 〉, for ψi = (−φj+1,2k + φj+1,2k+1)/
√u/2j where φ is
a [0, 1] vector defined for j = 1, . . . , log u and k = 0, . . . , 2j − 1. The
variance of wi is bounded by ε2jnu√m
∑(2k+2)u/2j+1
x=2ku/2j+1+1s(x).
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 214: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/214.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
nj= records in split jsj= split j sample frequency vector
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 215: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/215.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 216: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/216.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 217: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/217.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 218: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/218.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 219: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/219.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , 1)
In-MemoryMap
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 220: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/220.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
close()
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 221: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/221.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 222: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/222.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)p1
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.
If sj(x) ≥ 1/(ε√m), emit (x , sj(x)).
Else emit (x , 0) with probability ε√m · sj(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 223: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/223.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)p1
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
Else emit (x , 0) with probability ε√m · sj(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 224: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/224.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)p1
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
Else emit (x , 0) with probability ε√m · sj(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 225: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/225.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
2 Mapper j samples key x from s with probability min{ε√m · sj(x), 1}.If sj(x) ≥ 1/(ε
√m), emit (x , sj(x)).
Else emit (x , 0) with probability ε√m · sj(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 226: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/226.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
3 Construct s(x).
If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).
Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 227: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/227.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).
Else if (x , 0) received, M(x) = M(x) + 1.4 Finally, s(x) = ρ(x) + M(x)/ε
√m.
5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 228: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/228.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 229: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/229.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.
5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 230: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/230.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
(x , v(x))
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 231: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/231.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
(x , v(x))
Centralized Wavelet Top-k
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 232: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/232.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
(x , v(x))
Centralized Wavelet Top-k
close()
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 233: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/233.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
(x , v(x))
Centralized Wavelet Top-k
top-k |wi |
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 234: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/234.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x)|0)Reducerp1
In-MemoryMap
(x , sj(x)|0)(x , sj(x))
o1top-k |wi |
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
(x , v(x))
Centralized Wavelet Top-k
top-k |wi |
3 Construct s(x).If (x , sj(x)) received, ρ(x) = ρ(x) + sj(x).Else if (x , 0) received, M(x) = M(x) + 1.
4 Finally, s(x) = ρ(x) + M(x)/ε√
m.5 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 235: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/235.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 236: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/236.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 237: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/237.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 238: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/238.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 239: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/239.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 240: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/240.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 241: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/241.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!
330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 242: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/242.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Consider: ε = 10−4, m = 103, and 4-byte keys .
The communication for basic sampling is O(1/ε2).
Approximately 400MB of data must be communicated!
The communication for improved sampling is O(m/ε).
Approximately 40MB of data must be communicated.
The communication for two-level sampling is O(√
m/ε).
Only 1.2MB of data needs to be communicated!330-fold reduction over basic sampling and 33-fold reduction overimproved sampling!
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 243: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/243.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 244: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/244.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:
Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 245: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/245.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,
Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 246: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/246.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 247: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/247.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 248: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/248.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.
Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 249: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/249.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.
The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 250: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/250.jpg)
Experiments: Algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted Send-V,Our three round exact solution is denoted H-WTopk, (meaning”Hadoop Wavelet Top-k”).
Approximate Methods:
Improved Sampling is denoted Improved-S.Two-Level Sampling is denoted TwoLevel-S.The Sketch-Based Approximation using the GCS-Sketch is denotedSend-Sketch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 251: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/251.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 252: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/252.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU
2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 253: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/253.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 254: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/254.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).
3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 255: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/255.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 256: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/256.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 257: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/257.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 258: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/258.jpg)
Experiments: Setup
Experiments are performed in a heterogeneous Hadoop cluster with16 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 4 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
One is reserved for the (only) Reducer.
4 1 machine with 2GB of RAM and an Intel Core 2 1.86GHz CPU
All machines are directly connected to a 1000Mbps switch.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 259: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/259.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 260: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/260.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.
Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 261: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/261.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.
We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 262: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/262.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.
WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 263: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/263.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 264: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/264.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 265: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/265.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.
Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 266: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/266.jpg)
Experiments: Datasets
We utilize the WorldCup dataset to test all algorithms on real data.
There are a total of 1.35 billion records.Each record has 10 4 byte integer attributes including a client id andobject id.We assign each record a clientobject 4 byte integer id in u = 229
which is distinct for unique parings of a client id and object id.WorldCup is stored in binary format, in total it is 50GB.
We utilize large synthetic Zipfian datasets to evaluate all methods.
Keys are randomly permuted and discontiguous in a dataset.Each key is a 4-byte integer and stored in binary format.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 267: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/267.jpg)
Experiments: Defaults
Default values:
Symbol Definition Defaultα Zipfian skewness 1.1u max key in domain log2 u = 29n total records 13.4 billion
dataset size 50GBβ split size 256MBm number of splits 200B network bandwidth 500Mbps
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 268: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/268.jpg)
Experiments: Vary k
10 20 30 40 50
106
108
1010
1012
Number of Coefficients (k)
Com
mun
icat
ion
(Byt
es)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 269: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/269.jpg)
Experiments: Vary k
10 20 30 40 5010
2
103
104
105
Number of Coefficients (k)
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 270: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/270.jpg)
Experiments: Vary k
10 20 30 40 5010
14
1015
1016
1017
1018
Number of Coefficients (k)
SS
E
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Ideal SSE
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 271: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/271.jpg)
Experiments: Vary ε
10−5
10−4
10−3
10−2
10−1
1014
1016
1018
1020
1022
ε
SS
E
H−WTopkImproved−STwoLevel−S
Ideal SSE
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 272: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/272.jpg)
Experiments: Vary ε
10−5
10−4
10−3
10−2
10−1
103
104
105
106
107
108
ε
Com
mun
icat
ion
(Byt
es)
Improved−STwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 273: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/273.jpg)
Experiments: Vary ε
10−5
10−4
10−3
10−2
10−1
101
102
103
ε
Tim
e (S
econ
ds)
Improved−STwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 274: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/274.jpg)
Experiments: Vary n
0 50 100 150 20010
5
107
109
1011
Dataset Size (GB)
Com
mun
icat
ion
(Byt
es)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 275: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/275.jpg)
Experiments: Vary n
0 50 100 150 20010
2
103
104
105
106
Dataset Size (GB)
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 276: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/276.jpg)
Experiments: Vary u
20 22 24 26 28 30 3210
6
108
1010
1012
log2 u
Com
mun
icat
ion
(Byt
es)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 277: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/277.jpg)
Experiments: Vary u
20 22 24 26 28 30 3210
2
103
104
105
106
log2 u
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 278: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/278.jpg)
Experiments: Vary β
64 128 256 512
106
108
1010
1012
Split Size (MB)
Com
mun
icat
ion
(Byt
es)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 279: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/279.jpg)
Experiments: Vary β
64 128 256 51210
2
103
104
105
106
Split Size (MB)
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 280: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/280.jpg)
Experiments: Vary α
100
102
104
106
108
1010
1012
1014
α=0.8 α=1.1 α=1.4
Com
mun
icat
ion
(Byt
es)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 281: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/281.jpg)
Experiments: Vary α
100
102
104
106
108
α=0.8 α=1.1 α=1.4
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 282: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/282.jpg)
Experiments: Vary B
0 20 40 60 80 10010
2
103
104
105
106
Bandwidth (% of 100Mbps)
Tim
e (S
econ
ds)
Send−V H−WTopk Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 283: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/283.jpg)
Experiments: WorldCup Dataset
100
102
104
106
108
1010
Com
mun
icat
ion
(Byt
es)
Send−V
H−WTopk
Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 284: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/284.jpg)
Experiments: WorldCup Dataset
100
101
102
103
104
105
106T
ime
(Sec
onds
)
Send−V
H−WTopk
Send−Sketch
Improved−S TwoLevel−S
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 285: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/285.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 286: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/286.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.
TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 287: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/287.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 288: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/288.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 289: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/289.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 290: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/290.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 291: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/291.jpg)
Conclusions
We study the problem of efficiently computing wavelet histograms inMapReduce clusters.
We present both exact and approximate algorithms.TwoLevel-S is especially easy to implement and ideal in practice.
For 200GB of data with log2 u = 29 it takes 10 minutes with only2MB of communication!
Our work is just the tip of the iceberg for data summarizationtechniques in MapReduce.
Many others remain including:
other histograms including the V-optimal histogram,sketches and synopsis,geometric summaries (ε-approximations and coresets),graph summaries (distance oracles).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 292: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/292.jpg)
The End
Thank You
Q and A
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 293: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/293.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 294: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/294.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
w6 = 〈v, ψ6〉
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 295: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/295.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
w6 = 〈v, ψ6〉w6 = 〈v, ψ6〉 =〈[0 0 v(3) 0 0 0 0 0], ψ6〉 + 〈[ 0 0 0 v(4) 0 0 0 0], ψ6〉
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 296: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/296.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
w3 = 〈v, ψ3〉
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 297: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/297.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
w2 = 〈v, ψ2〉
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 298: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/298.jpg)
Introduction: Histograms
We may also compute wi with the wavelet basis vectors ψi .
3 5 10 8 2 2 10 14 ` = 3
x 1 2 3 4 5 6 7 8v(x) 3 5 10 8 2 2 10 14
v(1) v(2) v(3) v(4) v(5) v(6) v(7) v(8)
w5 w6 w7 w8a4 a5 a6 a72 120 2-1 9
w3 a22.5 6.5 w4 a35 7
w2 a10.3 6.8
w1 6.8 total average
` = 2
` = 1
` = 0
1 4
w1 = 〈v, ψ1〉
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 299: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/299.jpg)
Background: Hadoop MapReduce, Map Phase
split 1split 2split 3split 4
JobTracker
MapRunner
The JobTracker assigns an InputSplit to a TaskTracker, aMapRunner task runs on the TaskTracker to process the split.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 300: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/300.jpg)
Background: Hadoop MapReduce, Map Phase
split 1split 2split 3split 4
JobTracker
MapRunner
RecordReader
(k1, v1)
The MapRunner acquires a RecordReader from the InputFormat forthe file to view the InputSplit as a stream of records, (k1, v1).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 301: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/301.jpg)
Background: Hadoop MapReduce, Map Phase
split 1split 2split 3split 4
JobTracker
MapRunner
RecordReader
(k1, v1)
(k1, v1)Mapper
Buffer
(k2, v2)
The MapRunner invokes the user specified Mapper for each (k1, v1),the Mapper emits (k2, v2) and stores in an in-memory buffer.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 302: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/302.jpg)
Background: Hadoop MapReduce, Map Phase
split 1split 2split 3split 4
JobTracker
MapRunner
RecordReader
(k1, v1)
(k1, v1)Mapper
Buffer
(k2, v2) (k2, list(v2)) (k2, v2)Combiner
p1p2
When the buffer fills, the optional Combiner is executed over(k2, list(v2)), and a (k2, v2) is dumped to a partition on disk.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 303: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/303.jpg)
Background: Hadoop MapReduce, Shuffle and Sort Phase
JobTracker
p1p2
Reducer:Copy
(k2, v2)
p1p2
(k2, v2)
Reducer:Sort
(k2, v2)
The JobTracker assigns Reducers to TaskTrackers for each partition,each reducer first copies on (k2, v2) and then sorts on k2.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 304: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/304.jpg)
Background: Hadoop MapReduce, Reduce Phase
JobTracker
p1p2
Reducer:Copy
(k2, v2)
p1p2
(k2, v2)
Reducer:Sort
(k2, v2) (k3, v3)
oiReducer:Reduce
(k2, list(v2))
The sorting output (k2, list(v2)) is processed one k2 at a time andreduced, the reduced output (k3, v3) is written to reducer output oi .
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 305: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/305.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 306: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/306.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
In-MemoryMap
In-MemoryMap
In-MemoryMap
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 307: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/307.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
In-MemoryMap
In-MemoryMap
In-MemoryMap
close()
close()
close()
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 308: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/308.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 309: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/309.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 310: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/310.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 311: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/311.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
p1
p1
p1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 312: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/312.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 313: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/313.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 314: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/314.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 315: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/315.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 316: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/316.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Reduce Phase
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 317: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/317.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 318: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/318.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 319: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/319.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
T1 = 22, T1/m = 22/3
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 320: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/320.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
T1 = 22, T1/m = 22/3
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 321: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/321.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
T1 = 22, T1/m = 22/3
HDFS or Local
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 322: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/322.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
i wi ,1
5 20
i wi ,1
3 -30
i wi ,2
5 12
i wi ,2
6 -20
i wi ,3
1 10
i wi ,3
6 -10
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
Ri (split) j wi ,j
1 3 103 1 -305 1 205 2 126 2 -206 3 -10
Ri wi Fx τ+(wi) τ−(wi) τ (wi)1 10 001 42 -40 03 -30 100 -8 -60 85 32 110 42 22 226 -30 011 -10 -60 10
Close Phase
T1 = 22, T1/m = 22/3
T1/m
CoordinatorState
HDFS
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 323: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/323.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 324: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/324.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
bypass
bypass
split 1:saved wi ,j
split 2:saved wi ,j
split 3:saved wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 325: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/325.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 326: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/326.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 327: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/327.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 328: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/328.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 329: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/329.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx1 10 0013 -30 1005 32 1106 -30 011
update
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 330: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/330.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))update
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 331: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/331.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))update
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
reduce phase
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 332: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/332.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))update
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 333: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/333.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 334: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/334.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 335: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/335.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 336: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/336.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
CoordinatorState
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 337: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/337.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
HDFS or Local
CoordinatorState
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 338: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/338.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 339: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/339.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
CoordinatorState
wi ∈ Rmissing a wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 340: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/340.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
T1/m = 22/3
Job Configuration
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(6(split 1,-15))
(3,(split 2,-14))
Ri wi Fx τ+(wi) τ−(wi) τ (wi) τ ′(wi)1 10 001 24.6 -4.6 0 24.63 -44 110 -36.6 -51.3 36.6 51.35 32 110 39.3 24.6 24.6 39.36 -45 111 -45 -45 45 45
close phase
T2 = 45
HDFS
CoordinatorState
wi ∈ Rmissing a wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 341: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/341.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 342: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/342.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
bypass
bypass
T1/m = 22/3
Job Configuration
split 1:saved wi ,j
split 2:saved wi ,j
split 3:saved wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 343: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/343.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 344: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/344.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 345: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/345.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 346: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/346.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 347: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/347.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
T1/m = 22/3
Job Configuration
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 348: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/348.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
Ri wi Fx3 -44 1106 -45 111
T1/m = 22/3
Job Configuration
CoordinatorState
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 349: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/349.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
Ri wi Fx3 -44 1106 -45 111
update
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 350: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/350.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6)) update
Ri wi Fx3 -38 1116 -45 111
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 351: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/351.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
Ri wi Fx3 -38 1116 -45 111
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 352: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/352.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
Ri wi Fx3 -38 1116 -45 111
w6 = −45?
T1/m = 22/3
Job Configuration
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 353: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/353.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i3
Distributed Cache
bypass
Mapper
Mapper
Mapper
split 1i wi ,1
5 202 71 64 -26 -153 -30
split 2i wi ,2
5 124 71 22 -53 -146 -20
split 3i wi ,3
1 103 64 52 -35 -66 -10
bypass
bypass
Reducer
(3,(site 3,6))
Ri wi Fx3 -38 1116 -45 111
w6 = −45?
w6 = −45
T1/m = 22/3
Job Configuration
Results
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 354: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/354.jpg)
Outline
1 Introduction and MotivationHistogramsMapReduce and Hadoop
2 Exact Top-k Wavelet CoefficientsNaive SolutionHadoop Wavelet Top-k : Our Efficient Exact Solution
3 Approximate Top-k Wavelet CoefficientsLinearly Combinable Sketch MethodOur First Sampling Based ApproachAn Improved Sampling ApproachTwo-Level Sampling
4 Experiments
5 ConclusionsHadoop Wavelet Top-k in Hadoop
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 355: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/355.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 356: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/356.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
(x , 1)
(x , 1)
(x , 1)
In-MemoryMap
In-MemoryMap
In-MemoryMap
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 357: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/357.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
In-MemoryMap
In-MemoryMap
In-MemoryMap
close()
close()
close()
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 358: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/358.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 359: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/359.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
Priority Queue:Largest wi ,j
Priority Queue:Smallest wi ,j
HDFS:wi ,j Streamed to disk
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 360: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/360.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 361: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/361.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
close()
close()
close()
StreamingCompute wi
StreamingCompute wi
StreamingCompute wi
(x , v1(x))
(x , v2(x))
(x , v3(x))
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
p1
p1
p1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
k largest
k largest
k largest
k smallest
k smallest
k smallest
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 362: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/362.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 363: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/363.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 364: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/364.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
Compute
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 365: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/365.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
Compute
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Reduce phase
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 366: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/366.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
Compute
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Close phase
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 367: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/367.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
Compute
T1
k ’th largest τ (wi)
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Close phase
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 368: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/368.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 1
split 1split 2split 3
(x , null)
(x , null)
(x , null)
JobTracker
Mapper
Mapper
Mapper
w+1 w2,1
p1
(i , (split 1,wi ,1))
p1
(i , (split 2,wi ,2))
p1
(i , (split 3,wi ,3))
Reducer
w−1 w3,1
w+2 w1,2
w−2 w4,2
w+3 w2,3
w−3 w5,3
ComputeCoordinatorState
HDFS
Local or HDFS
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Close phase
T1
k ’th largest τ (wi)
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 369: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/369.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 370: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/370.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
bypass
bypass
split 1:saved wi ,j
split 2:saved wi ,j
split 3:saved wi ,j
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 371: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/371.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 372: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/372.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
Reducer
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 373: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/373.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 374: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/374.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 375: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/375.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Refine
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 376: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/376.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Refine Reduce phase
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 377: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/377.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Refine Close phase
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 378: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/378.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
Refine Close phase
T2
k ’th largest τ (wi)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 379: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/379.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
Close phase
• Reducer computes τ ′(wi)and prunes wi from R withτ ′(wi) < T2R
wi
Fiτ+(wi)τ−(wi)τ (wi)
T2
k ’th largest τ (wi)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 380: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/380.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
Close phase
CoordinatorState
Local or HDFS
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
T2
k ’th largest τ (wi)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 381: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/381.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 2
split 1split 2split 3
JobTracker
T1/m
Job Configuration
bypass
Mapper
Mapper
Mapper
bypass
bypass
ReducerCoordinatorState
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
|wi ,j | > T1/m
|wi ,j | > T1/m
|wi ,j | > T1/m
(i , (split i ,wi ,j))
Close phase
i for wi ∈ Rmissing a wi ,j
HDFS
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
T2
k ’th largest τ (wi)
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 382: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/382.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
T1/m
Job Configuration
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 383: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/383.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
T1/m
Job Configuration
split 1:saved wi ,j
split 2:saved wi ,j
split 3:saved wi ,j
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 384: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/384.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
T1/m
Job Configuration
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 385: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/385.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
T1/m
Job Configuration
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 386: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/386.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
T1/m
Job Configuration
CoordinatorState
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 387: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/387.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
T1/m
Job Configuration
CoordinatorState
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 388: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/388.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
Update
T1/m
Job Configuration
CoordinatorState
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
R
wi
Fiτ+(wi)τ−(wi)τ (wi)
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 389: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/389.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
Update
T1/m
Job Configuration
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
R wi
Exact wi
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 390: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/390.jpg)
Exact Top-k Wavelet Coefficients: Hadoop Phase 3
split 1split 2split 3
JobTracker
i ∈ Rmissing wi ,j
Distributed Cache
bypass
bypass
bypass
Reducer
T1/m
Job Configuration
Results
Mapper
Mapper
Mapper
split 1w2,1
w1,1
w4,1
w5,1
w6,1
w3,1
split 2w1,2
w2,2
w6,2
w3,2
w5,2
w4,2
split 3w2,3
w1,3
w6,3
w4,3
w3,3
w5,3
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
i ∈ R and|wi ,j | ≤ T1/m
(i , (split i ,wi ,j))
R wi
Exact wi
top-k (i , |wi |)
k = 1
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 391: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/391.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 392: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/392.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 393: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/393.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 394: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/394.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 395: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/395.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 396: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/396.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , 1)
In-MemoryMap
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 397: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/397.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
close()
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 398: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/398.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 399: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/399.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))p1
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 400: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/400.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 401: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/401.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 402: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/402.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
(x , v(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 403: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/403.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 404: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/404.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
close()
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 405: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/405.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
top-k |wi |
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 406: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/406.jpg)
Approximate Top-k Wavelet Coefficients: Basic RandomSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
top-k |wi |
o1top-k |wi |
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j (RRj) samples nj/ε2n records.
1 RRj randomly selects nj/ε2n offsets in split j .
2 RRj sorts the offsets in ascending order then seeks the record at eachsampled offset.
2 Reducer uses v(x) = s(x)/p, our unbiased estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 407: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/407.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 408: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/408.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 409: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/409.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 410: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/410.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.
2 E[Xj ] = ε√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 411: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/411.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 412: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/412.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 413: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/413.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x)
= ε√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 414: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/414.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 415: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/415.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m]
= ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 416: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/416.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x))
= s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 417: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/417.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an unbiased estimator of s(x).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 E[Xj ] = ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 E[M] =∑m′
j=1 ε√m · sj(x) = ε
√m(s(x)− ρ(x)).
4 E[s(x)] = E[ρ(x) + M/ε√
m] = ρ(x) + (s(x)− ρ(x)) = s(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 418: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/418.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 419: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/419.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 420: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/420.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 421: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/421.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.
2 Var[Xj ] = ε√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 422: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/422.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x))
≤ ε√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 423: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/423.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 424: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/424.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 425: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/425.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ]
≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 426: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/426.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x)
≤ m′ · ε√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 427: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/427.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 428: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/428.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 429: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/429.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m]
= Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 430: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/430.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m
≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 431: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/431.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m
≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 432: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/432.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2
≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 433: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/433.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
s(x) is an estimator of s(x) with standard deviation at most 1/ε.
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
Assume in the first m′ splits sj(x) < 1/(ε√
m).
Let Xj = 1 if x is sampled in split j and 0 otherwise.2 Var[Xj ] = ε
√m · sj(x)(1− ε
√m · sj(x)) ≤ ε
√m · sj(x).
Let M =∑m′
j=1 Xj .
3 Var[M] ≤∑m′
j=1 Var[Xj ] ≤∑m′
j=1 ε√m · sj(x) ≤ m′ · ε
√m · 1/(ε
√m)
= m′.
4 Var [s(x)] = Var[M/ε√
m] = Var[M]/ε2m ≤ m′/ε2m ≤ 1/ε2 ≤ 1/ε.
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 434: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/434.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 435: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/435.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 436: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/436.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 437: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/437.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).
2 There are ≤ (1/ε2)/(1/ε√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 438: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/438.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 439: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/439.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 440: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/440.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2
=√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 441: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/441.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 442: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/442.jpg)
Approximate Top-k Wavelet Coefficients: Two-LevelSampling
Theorem
The expected total communication cost of our two-level samplingalgorithm is O(
√m/ε).
Proof.
1 Our estimator is s(x) = ρ(x) + M/ε√
m.
The first-level sample size is pn = 1/ε2.
If sj(x) ≥ 1/(ε√
m) we emit (x , sj(x)).2 There are ≤ (1/ε2)/(1/ε
√m) =
√m/ε such keys.
If sj(x) < 1/(ε√
m),we emit (x , null) with probability ε
√m · sjx .
3 On expectation there are,∑j
∑x ε√m · sj(x) ≤ ε
√m · 1/ε2 =
√m/ε.
By (2) and (3), the total number of emitted keys is O(√
m/ε).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 443: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/443.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 444: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/444.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 445: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/445.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 446: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/446.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , 1)
In-MemoryMap
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 447: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/447.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
close()
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 448: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/448.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 449: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/449.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))p1
In-MemoryMap
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 450: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/450.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 451: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/451.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 452: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/452.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
(x , v(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 453: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/453.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 454: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/454.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
close()
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 455: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/455.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
top-k |wi |
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce
![Page 456: Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram](https://reader035.vdocuments.us/reader035/viewer/2022062608/60af6c58c59ce91ad534d321/html5/thumbnails/456.jpg)
Approximate Top-k Wavelet Coefficients: ImprovedSampling
split 1split 2split 3split 4
JobTracker
MapRunner
RandomizedRecordReader
(x , null)
(x , null)Mapper
(x , sj(x))Reducerp1
In-MemoryMap
(x , sj(x))
(x , sj(x))
Centralized Wavelet Top-k
(x , v(x))
top-k |wi |
o1top-k |wi |
nj= records in split jsj= split j sample frequency vector
close()
(x , s(x))
1 RandomizedRecordReader j samples tj = nj/ε2n records.
2 If sj(x) > εtj , the Mapper emits (x , sj(x)).
3 Reducer uses v(x) = s(x)/p, our estimator for v(x).
Jeffrey Jestes, Ke Yi, Feifei Li Building Wavelet Histograms on Large Data in MapReduce