sampling based range partition for big data analytics + some extras
DESCRIPTION
Sampling Based Range Partition for Big Data Analytics + Some Extras. Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou. INQUEST Workshop, September 2012. Big Data Analytics. - PowerPoint PPT PresentationTRANSCRIPT
Sampling Based Range Partition for Big Data Analytics
+ Some Extras
Milan VojnovićMicrosoft Research
Cambridge, United Kingdom
Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou
INQUEST Workshop, September 2012
2
Big Data Analytics
• Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data
• Some figures of scale– Peta / Tera bytes of online services data processed daily– 200M tweets per day (Twitter)– 1B of content pieces shared per day (Facebook)– 8,000 Exabytes of global data by 2015 (The Economist)
3
Research Agenda
Machine learning OptimizationDatabase
queries
Distributed computing system
4
Outline
• Range Partitionwith Fei Xu and Jingren Zhou
• Count Trackingwith Zhenming Liu and Bozidar Radunovic
• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic
5
Range Partition
• Special interest: balanced range partition
1 23241024
883 1201 23241024
812052 1 2324102412083 52
1-100 101-250 950-1024
. . .
. . .
(120,10)(120,5) (120,4)
1 2 k
1 2 m
6
Range Partition Requirements• Given and and desired relative
partition sizes
• -accurate range partition:
with probability at least
𝑄𝑖
𝑖
= number of data items assigned to range
7
Two Approaches
• Sampling based methods– Take a sample of data items– Compute partition boundaries using the sample
• Quantile summary methods– At each node compute a local quantile summary– Merge at the coordinator node
8
Related Work• Sampling based estimation of histograms
studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998)
Required sample size:
• Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) :
For therwise:
9
Related Work (cont’d)
• Quantile summaries based approach (Greenwald and Khanna, 2001)
Communication cost =
• Pros– Deterministic guarantee
• Cons– It requires sorting of data items– Largest frequency of an item must be at most
10
Problem
• Range partition data while making one pass through data with minimal communication between the coordinator and sites
11
Sampling Based Method
1
2
k
coordinator
• Collect samples and partition using the samples
.
.
.• Pros
– simplicity, scalability• Cons
– how many samples to take from each site?
data size imbalance: number of data input records per machine may differ from one machine to another
12
Data Sizes Imbalance
Dataset Records Bytes Sites
DataSet-1 62M 150G 262
DataSet-2 37M 25G 80
DataSet-3 13M 0.26G 1
DataSet-4 7M 1.2T 301
DataSet-5 106M 7T 5652
13
Origins of Data Sizes Imbalance
• JOINSELECTFROM A INNER JOIN B ON A.KEY==B.KEYORDER BY COL
• Lookup TableIf the record value of column X is inthe lookup table, then return the row
• UNPIVOTInput: Col 1 Col 2
1 2, 32 3, 9, 8, 13…
Output: (1,2), (1,3), (2,3), (2,9), …
14
Weighted Sampling Scheme
• SAMPLE: Each site reports a random sample of t/k data items and the total number of items
• MERGE: Summary created by adding each data item from site for times
• PARTITION: Use the summary to determine partition boundaries
Note: the total number of data items reported by a site only once available – the site made one pass through local data
15
SAMPLE
(𝑛1 ,𝑆1)
(𝑛2 ,𝑆2)
(𝑛𝑘 ,𝑆𝑘)
1
2
k
coordinator
𝑆 𝑖={𝑎1𝑖 ,𝑎2
𝑖 ,…,𝑎𝑡𝑖 }
.
.
.
16
MERGE
coordinator
.
.
.
𝑛1 ,𝑆1={𝑎11 ,𝑎2
1 ,…,𝑎𝑡1}
𝑛2 ,𝑆2={𝑎12 ,𝑎2
2 ,… ,𝑎𝑡2 }
𝑛𝑘 ,𝑆𝑘={𝑎1𝑘 ,𝑎2
𝑘 ,… ,𝑎𝑡𝑘 }
𝑆={…,𝑎2𝑖 ,𝑎2
𝑖 ,…,𝑎2𝑖 ,…}
𝑛𝑖 ,𝑆 𝑖={𝑎1𝑖 ,𝑎2
𝑖 ,…,𝑎𝑡𝑖 }
.
.
.
replicas
17
PARTITION
coordinator
0
1
Range 1 2 3 4 5
Empirical CDF of data summary
18
Sufficient Sample Size
• Assume For sample size
-accurate range partition w. p.
• largest frequency of a data value
19
Constant Factor Imbalance
• Suppose that for some
• Then
20
Proof Outline
• Large deviation analysis of the error exponent:
𝜈 𝑗
21
Performance
• DataSet-1
• 100K data records per range,
22
Performance (cont’d)
𝑎=1 𝑎=2 𝑎=4
23
Summary for Range Partitioning
• Novel weighted sampling scheme
• Provable performance guarantees
• Simple and practical– Coder transfer to Cosmos
• More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012
24
Outline
• Range Partitionwith Fei Xu and Jingren Zhou
• Count Trackingwith Zhenming Liu and Bozidar Radunovic
• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic
25
SUM Tracking ProblemMaintain estimate
1 2 3 k
SUM:
:
𝑋 1𝑋 2𝑋 3 𝑋 4
𝑋 5
26
SUM Tracking
𝑆𝑡
𝑡
(1−𝜖)𝑆𝑡
(1+𝜖 )𝑆𝑡
27
Applications
• Ex 1: database queries
SELECT SUM(AdBids)from Ads
• Ex 2: iterative solving
input data
28
State of the Art
• Count tracking [Huang, Yi and Zhang, 2011]
– Worst-case input, monotonic sum– Expected total communication:
messages
• Lower bound for worst case input [Arackaparambil, Brody and Chakrabarti, 2009]
– Expected total communication messages
29
The Challenge • Q: What are communication cost efficient
algorithms for the sum tracking problem with random input streams? – Random permutation– Random i.i.d.– Fractional Brownian motion
30
Communication Complexity Bounds
• Lower bound:
• Upper bound:
Sublinear, “price of non-monotonicity”:
31
Communication Complexity BoundsUnknown Drift Case
• Input: i.i.d. Bernoulli : unknown drift parameter
Expected total communication:
messages
• Generalizes monotonic case to constant drift case
32
Our Tracker Algorithm• Each site reports to the coordinator upon receiving a
value update with probability
• Sync all whenever the coordinator receives an update from a site
XiMi = 1Sk
S1S
S site
site
coordinator
S, Sk
S, S1S = S1 + … + Sk
33
Two Applications
• Second Frequency Moment
• Bayesian Linear Regression
34
App 1: Second Frequency Moment
• Input:
• Counter of value :
• Second frequency moment:
• Goal: track within relative accuracy
35
𝑆𝑡𝑖=1𝑠1∑𝑗
𝑆𝑡𝑖 , 𝑗
AMS Sketch
𝑆𝑡1
𝑆𝑡𝑠2
⋯
⋯
⋯ ⋯
𝑖
11
𝑠2
𝑠1𝑗
𝑆𝑡𝑖 , 𝑗
⋯⋯
𝑆𝑡=median(𝑆𝑡𝑖 )
{0,1} valued hash
• For and ,
within w. p.
36
App 1: Second Frequency Moment (cont’d)
• Sum tracking:
• Expected total communication:
37
App 2: Bayesian Linear Regression
• Feature vector , output • • Prior osterior
𝑥𝑡
𝑦 𝑡
38
App 2: Bayesian Linear Regression (cont’d)
• Posterior mean and precision:
• Sum tracking:
• Under random permutation input, the expected communication cost =
39
Summary for Sum Tracking
• Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion
• Proposed a novel algorithm with nearly optimal communication complexity
• Details: ACM PODS 2012
40
Outline
• Range Partitionwith Fei Xu and Jingren Zhou
• Count Trackingwith Zhenming Liu and Bozidar Radunovic
• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic
41
• Partition a graph with two objectives– Sparsely connected
components– Balanced number of
vertices per component
• Applications– Parallel processing– Community detection
Problem
42
Problem (cont’d)
1 2 3 k
• Requirements– Streaming algorithm– Single pass /
incremental– Efficient computing
• Desired– Approximation
guarantees– Average-case efficient
43
Summary for Graph Partitioning
• Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics
• Provable approximation guarantees
• More details available soon