sampling based range partition for big data analytics + some extras

Sampling Based Range Partition for Big Data Analytics

+ Some Extras

Milan VojnovićMicrosoft Research

Cambridge, United Kingdom

Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou

INQUEST Workshop, September 2012

2

Big Data Analytics

• Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data

• Some figures of scale– Peta / Tera bytes of online services data processed daily– 200M tweets per day (Twitter)– 1B of content pieces shared per day (Facebook)– 8,000 Exabytes of global data by 2015 (The Economist)

3

Research Agenda

Machine learning OptimizationDatabase

queries

Distributed computing system

4

Outline

• Range Partitionwith Fei Xu and Jingren Zhou

• Count Trackingwith Zhenming Liu and Bozidar Radunovic

• Graph Partitioning (def. only)with Charalampos Tsourakakis and Bozidar Radunovic

5

Range Partition

• Special interest: balanced range partition

1 23241024

883 1201 23241024

812052 1 2324102412083 52

1-100 101-250 950-1024

. . .

. . .

(120,10)(120,5) (120,4)

1 2 k

1 2 m

6

Range Partition Requirements• Given and and desired relative

partition sizes

• -accurate range partition:

with probability at least

𝑄𝑖

𝑖

= number of data items assigned to range

7

Two Approaches

• Sampling based methods– Take a sample of data items– Compute partition boundaries using the sample

• Quantile summary methods– At each node compute a local quantile summary– Merge at the coordinator node

8

Related Work• Sampling based estimation of histograms

studied by Chaudhuri, Motwani and Narasayya (ACM SIGMOD 1998)

Required sample size:

• Communication cost to draw samples without replacement (Trithapura and Woodruff, 2011) :

For therwise:

9

Related Work (cont’d)

• Quantile summaries based approach (Greenwald and Khanna, 2001)

Communication cost =

• Pros– Deterministic guarantee

• Cons– It requires sorting of data items– Largest frequency of an item must be at most

10

Problem

• Range partition data while making one pass through data with minimal communication between the coordinator and sites

11

Sampling Based Method

1

2

k

coordinator

• Collect samples and partition using the samples

.

.

.• Pros

– simplicity, scalability• Cons

– how many samples to take from each site?

data size imbalance: number of data input records per machine may differ from one machine to another

12

Data Sizes Imbalance

Dataset Records Bytes Sites

DataSet-1 62M 150G 262

DataSet-2 37M 25G 80

DataSet-3 13M 0.26G 1

DataSet-4 7M 1.2T 301

DataSet-5 106M 7T 5652

13

Origins of Data Sizes Imbalance

• JOINSELECTFROM A INNER JOIN B ON A.KEY==B.KEYORDER BY COL

• Lookup TableIf the record value of column X is inthe lookup table, then return the row

• UNPIVOTInput: Col 1 Col 2

1 2, 32 3, 9, 8, 13…

Output: (1,2), (1,3), (2,3), (2,9), …

14

Weighted Sampling Scheme

• SAMPLE: Each site reports a random sample of t/k data items and the total number of items

• MERGE: Summary created by adding each data item from site for times

• PARTITION: Use the summary to determine partition boundaries

Note: the total number of data items reported by a site only once available – the site made one pass through local data

15

SAMPLE

(𝑛1 ,𝑆1)

(𝑛2 ,𝑆2)

(𝑛𝑘 ,𝑆𝑘)

1

2

k

coordinator

𝑆 𝑖={𝑎1𝑖 ,𝑎2

𝑖 ,…,𝑎𝑡𝑖 }

.

.

.

16

MERGE

coordinator

.

.

.

𝑛1 ,𝑆1={𝑎11 ,𝑎2

1 ,…,𝑎𝑡1}

𝑛2 ,𝑆2={𝑎12 ,𝑎2

2 ,… ,𝑎𝑡2 }

𝑛𝑘 ,𝑆𝑘={𝑎1𝑘 ,𝑎2

𝑘 ,… ,𝑎𝑡𝑘 }

𝑆={…,𝑎2𝑖 ,𝑎2

𝑖 ,…,𝑎2𝑖 ,…}

𝑛𝑖 ,𝑆 𝑖={𝑎1𝑖 ,𝑎2

𝑖 ,…,𝑎𝑡𝑖 }

.

.

.

replicas

17

PARTITION

coordinator

0

1

Range 1 2 3 4 5

Empirical CDF of data summary

18

Sufficient Sample Size

• Assume For sample size

-accurate range partition w. p.

• largest frequency of a data value

19

Constant Factor Imbalance

• Suppose that for some

• Then

20

Proof Outline

• Large deviation analysis of the error exponent:

𝜈 𝑗

21

Performance

• DataSet-1

• 100K data records per range,

22

Performance (cont’d)

𝑎=1 𝑎=2 𝑎=4

23

Summary for Range Partitioning

• Novel weighted sampling scheme

• Provable performance guarantees

• Simple and practical– Coder transfer to Cosmos

• More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012

24

Outline




25

SUM Tracking ProblemMaintain estimate

1 2 3 k

SUM:

:

𝑋 1𝑋 2𝑋 3 𝑋 4

𝑋 5

26

SUM Tracking

𝑆𝑡

𝑡

(1−𝜖)𝑆𝑡

(1+𝜖 )𝑆𝑡

27

Applications

• Ex 1: database queries

SELECT SUM(AdBids)from Ads

• Ex 2: iterative solving

input data

28

State of the Art

• Count tracking [Huang, Yi and Zhang, 2011]

– Worst-case input, monotonic sum– Expected total communication:

messages

• Lower bound for worst case input [Arackaparambil, Brody and Chakrabarti, 2009]

– Expected total communication messages

29

The Challenge • Q: What are communication cost efficient

algorithms for the sum tracking problem with random input streams? – Random permutation– Random i.i.d.– Fractional Brownian motion

30

Communication Complexity Bounds

• Lower bound:

• Upper bound:

Sublinear, “price of non-monotonicity”:

31

Communication Complexity BoundsUnknown Drift Case

• Input: i.i.d. Bernoulli : unknown drift parameter

Expected total communication:

messages

• Generalizes monotonic case to constant drift case

32

Our Tracker Algorithm• Each site reports to the coordinator upon receiving a

value update with probability

• Sync all whenever the coordinator receives an update from a site

XiMi = 1Sk

S1S

S site

site

coordinator

S, Sk

S, S1S = S1 + … + Sk

33

Two Applications

• Second Frequency Moment

• Bayesian Linear Regression

34

App 1: Second Frequency Moment

• Input:

• Counter of value :

• Second frequency moment:

• Goal: track within relative accuracy

35

𝑆𝑡𝑖=1𝑠1∑𝑗

𝑆𝑡𝑖 , 𝑗

AMS Sketch

𝑆𝑡1

𝑆𝑡𝑠2

⋯

⋯

⋯ ⋯

𝑖

11

𝑠2

𝑠1𝑗

𝑆𝑡𝑖 , 𝑗

⋯⋯

𝑆𝑡=median(𝑆𝑡𝑖 )

{0,1} valued hash

• For and ,

within w. p.

36

App 1: Second Frequency Moment (cont’d)

• Sum tracking:

• Expected total communication:

37

App 2: Bayesian Linear Regression

• Feature vector , output • • Prior osterior

𝑥𝑡

𝑦 𝑡

38

App 2: Bayesian Linear Regression (cont’d)

• Posterior mean and precision:

• Sum tracking:

• Under random permutation input, the expected communication cost =

39

Summary for Sum Tracking

• Studied the sum tracking problem with non-monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion

• Proposed a novel algorithm with nearly optimal communication complexity

• Details: ACM PODS 2012

40

Outline




41

• Partition a graph with two objectives– Sparsely connected

components– Balanced number of

vertices per component

• Applications– Parallel processing– Community detection

Problem

42

Problem (cont’d)

1 2 3 k

• Requirements– Streaming algorithm– Single pass /

incremental– Efficient computing

• Desired– Approximation

guarantees– Average-case efficient

43

Summary for Graph Partitioning

• Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics

• Provable approximation guarantees

• More details available soon

sampling based range partition for big data analytics + some extras

Documents