blinkdb

82
Nitish Upreti [email protected] 1

Upload: nitish-upreti

Post on 08-Jul-2015

1.234 views

Category:

Engineering


1 download

DESCRIPTION

Introduction to BlinkDB : Queries with Bounded Errors and Bounded Response Times on Very Large Data

TRANSCRIPT

Page 2: Blinkdb

Goal : Solve Big Data !

2

How to achieve the best Performance ?

Page 3: Blinkdb

Hard Disks

½ - 1 Hour 1 - 5 Minutes 1 second

?Memory

100 TB on 1000 machines

3

Page 4: Blinkdb

Evolved To Evolves To ?

Better and Faster Frameworks ?

4

Page 5: Blinkdb

If we cannot do better thanIn-Memory than what?

5

Page 6: Blinkdb

Can we use Approximate Computing ?

6

Page 7: Blinkdb

Can you tolerate Errors ?Well, It depends on the

scenario right…

7

Page 8: Blinkdb

Overview of Big Data Space

8

Page 9: Blinkdb

Massive log Batch processing

9

Page 10: Blinkdb

Can we use Approximate Computing ?Answer : YES / NO

10

Page 11: Blinkdb

Streaming data processing

11

Page 12: Blinkdb

Can we use Approximate Computing ?Answer : MAYBE

12

Page 13: Blinkdb

Exploratory Data Analysis

13

Page 14: Blinkdb

Exploratory / InteractiveData Processing-- Getting a sense of data (Data

Scientists)-- Debugging ? (SREs / DevOps)

14

Page 15: Blinkdb

Can we use Approximate Computing ?Answer : YES !

15

Page 16: Blinkdb

1) BlinkDB : Queries with Bounded Errors andbounded Response Times on Very Large Data.

2) Blink and It’s Done : Interactive Queries on VeryLarge Data.

3) A General Bootstrap Performance Diagnostic.

4) Knowing When You’re Wrong : Building Fast andReliable Approximate Query Processing Systems.

Sameer Agrawal, Ariel Kleiner, Henry Milner, Barzan Mozafari,

Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Page 17: Blinkdb

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

17

Page 18: Blinkdb

Our GoalSupport interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_logAVG, COUNT, SUM, STDEV,

PERCENTILE etc.

18

Page 19: Blinkdb

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

FILTERS, GROUP BY clauses

Our Goal

19

Page 20: Blinkdb

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

JOINS, Nested Queries etc.

Our Goal

20

Page 21: Blinkdb

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

ML Primitives,User Defined Functions

Our Goal

21

Page 22: Blinkdb

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

ERROR WITHIN 10% AT CONFIDENCE 95%

Our Goal

22

Page 23: Blinkdb

Support interactive SQL-like aggregate queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

WITHIN 5 SECONDS

Our Goal

23

Page 24: Blinkdb

ID City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley 0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley 0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley 0.13

10 Berkeley 0.49

11 NYC 0.19

12 Berkeley 0.10

Query Execution on Samples

(Exploration Query)What is the average buffering ratio in the table?

0.2325 (Precise)

24

Page 25: Blinkdb

ID City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley 0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley 0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley 0.13

10 Berkeley 0.49

11 NYC 0.19

12 Berkeley 0.10

Query Execution on Samples

What is the average buffering ratio in the table?

ID City Buff Ratio Sampling Rate

2 NYC 0.13 1/4

6 Berkeley 0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.190.2325 (Precise)

25

Page 26: Blinkdb

ID City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley 0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley 0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley 0.13

10 Berkeley 0.49

11 NYC 0.19

12 Berkeley 0.10

Query Execution on Samples

What is the average buffering ratio in the table?

ID City Buff Ratio Sampling Rate

2 NYC 0.13 1/4

6 Berkeley 0.25 1/4

8 NYC 0.19 1/4

UniformSample

0.19 +/- 0.050.2325 (Precise)

26

Page 27: Blinkdb

ID City Buff Ratio

1 NYC 0.78

2 NYC 0.13

3 Berkeley 0.25

4 NYC 0.19

5 NYC 0.11

6 Berkeley 0.09

7 NYC 0.18

8 NYC 0.15

9 Berkeley 0.13

10 Berkeley 0.49

11 NYC 0.19

12 Berkeley 0.10

Query Execution on Samples

What is the average buffering ratio in the table?

ID City Buff Ratio Sampling Rate

2 NYC 0.13 1/2

3 Berkeley 0.25 1/2

5 NYC 0.19 1/2

6 Berkeley 0.09 1/2

8 NYC 0.18 1/2

12 Berkeley 0.49 1/2

UniformSample

$0.22 +/- 0.02

0.2325 (Precise)0.19 +/- 0.05

27

Page 28: Blinkdb

Speed/Accuracy Trade-offEr

ror

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Execution Time (Sample Size) 28

Page 29: Blinkdb

Erro

r

30 mins

Time to Execute on

Entire Dataset

InteractiveQueries

2 sec

Speed/Accuracy Trade-off

Pre-ExistingNoise Execution Time (Sample Size) 29

Page 30: Blinkdb

Where do you want to beon the curve ?

30

Page 31: Blinkdb

Sampling Vs No Sampling on100 Machines

0

100

200

300

400

500

600

700

800

900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data (10TB)

Qu

ery

Re

spo

nse

Tim

e (

Seco

nd

s)

102

1020

18 13 10 8

10x as response timeis dominated by I/O

31

Page 32: Blinkdb

Sampling Vs No Sampling

0

100

200

300

400

500

600

700

800

900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

Qu

ery

Re

spo

nse

Tim

e (

Seco

nd

s)

103

1020

18 13 10 8

(0.02%)(0.07%) (1.1%) (3.4%) (11%)

Error Bars

32

Page 33: Blinkdb

Okay, so you can tolerateerrors…

What are some of the fundamentalchallenges ?

What types of Sample to create ? (cannotSample everything)This boils to : What is our assumption on thenature of future query workload ?

33

Page 34: Blinkdb

Usual Assumption: Futurequeries are SIMILAR topast queries.

What is Similarity ?( Choosing the wrong notion has a heavy penalty :Under / Over fitting )

34

Page 35: Blinkdb

Workload Taxonomy

35

Page 36: Blinkdb

Predictable QCS • Fits well on the model of exploratory queries.

(Queries are usually distinct but most will usethe same column)

• What kind of videos are popular for a region ?- Require looking at data from thousands of videos andhundreds of geographical regions. However fixed columnsets : “video titles” (for grouping) and “viewer location” (forfiltering).

• Backed by empirical evidence from Conviva &Facebook. Key reason for BlinkDBefficiency. ( Lots of work in Database theory)

36

Page 37: Blinkdb

37

Page 38: Blinkdb

BlinkDB Overview

38

Page 39: Blinkdb

What is BlinkDB?A framework built on Shark and Spark that …

- Creates and maintains a variety of offlinesamples from underlying data.

- Returns fast, approximate answers with error bars by executing queries on samples of data ( Runtime Error Latency Profile for Sample selection )

- Verifies the correctness of the error bars that it returns at runtime.

39

Page 40: Blinkdb

1) Sample Creation

40

Page 41: Blinkdb

Building Samples for Queries• Uniform Sampling Vs Stratified Sampling.

• Uniform sampling is however inefficient forqueries that compute aggregates fromgroup :

- We could simply miss under-representing group.

- We care about error of each query equally, with uniformsampling we would be assigning more samples to a groupwhich is more represented.

• Solution : Sample size assignment isdeterministic and not random. This can beachieved with Stratified sampling. 41

Page 42: Blinkdb

Some Terminology …

42

Page 43: Blinkdb

QCS to Sample On

43

Page 44: Blinkdb

What QCS to sample on ? • Formulation as an optimization problem,

where three major factors to consider are :“sparsity” of data, “workload characteristics”and “storage cost of samples”.

• Sparsity : Define a sparsity function as thenumber of groups whose size in ‘T’ is lessthan some number ‘M’.

44

Page 45: Blinkdb

QCS to sample (Contd)…

• Workload : A query with QCS ‘qj’ has some unknown probability ‘pj’. The best estimate of pj is past frequency of queries with QCS qj.

• Storage Cost : Assume a simple formulation where K is same for each group. For a set of columns in ϕ the storage cost is |S(ϕ,K)| :

45

Page 46: Blinkdb

Goal : Maximize theweighted sum of coverage.where ‘coverage’ for a query ‘qi’ given a sample is

defined as the probability that a given value ‘x’ for thecolumns is also present among the rows of S(ϕi,K).

46

Page 47: Blinkdb

Optimization ProblemOptimize the following MILP :

where ‘m’ are all possible QCS and ‘j’ indexes over all

queries and ‘i’ over all column sets. 47

Page 48: Blinkdb

How to sample ?

48

Page 49: Blinkdb

Given a known QCS …• Compute Sample Count for a Group :

- K = min ( n’ / D(ϕ) , |T x0 | )

• Take Samples as :- For each group, sample K rows uniformly at random

without replacement forming sample Sx

• The entire sample S(ϕ,K) is the disjoint unionof multiple Sx :

- If |Tx| > K, we answer based on K random tuples otherwisewe can provide an exact answer.

• For aggregate function AVG, SUM, COUNTand Quartile, K directly determines error . 49

Page 50: Blinkdb

Sharing QCS

• Multiple queries with different ‘t’ and ‘n’ willshare the same QCS. We need to select asubset from our sample dependency.

• We need an appropriate storage techniqueto allow such subsets to be identified atruntime.

50

Page 51: Blinkdb

• The rows of stratified sample S(ϕ,K) are

stored sequentially according to order of

columns in ϕ.

• When Sx is spread over multiple HDFS

blocks, each block contains a random

subset from Sx .

• It is then enough to read any subset of

blocks comprising Sx as long as these

blocks contained minimum needed

records.

Storage Technique

51

Page 52: Blinkdb

Bij = Data Block

(HDFS)

52

Page 53: Blinkdb

Storage Requirement

A table with 1 billion (10^12) tuples and acolumn set with a Zipf distribution (heavytailed) with an exponent of 1.5, it turns outthat the storage required by sample S(ϕ, K) isonly 2.4% of the original table for K = 10^4,5.2% for K = 10^5 , and 11.4% for K = 10^6.This is also consistent with real world datafrom Conviva & Facebook.

53

Page 54: Blinkdb

What is BlinkDB?A framework built on Shark and Spark that …

- creates and maintains a variety of samples from underlying data.

- returns fast, approximate answers by executing queries on samples of data with error bars.

- verifies the correctness of the error bars that it returns at runtime.

54

Page 55: Blinkdb

2) BlinkDB Runtime

55

Page 56: Blinkdb

Selecting a Sample• If BlinkDB finds one or more stratified

samples on a set of columns ‘ϕi’ such that ourquery ‘q’ ⊆ ϕi , we pick the ϕi with smallestnumber of columns.

• If no such QCS samples exist, run ‘q’ on in-memory subsets of all samples maintainedby the system. Out of these, we select thosewith high selectivity.

- Selectivity = Number of Rows Selected by q / Number ofrows read by q

56

Page 57: Blinkdb

Selecting right Sample Size

• Construct an ELP (Error Latency Profile) thatcharacterizes the rate at which the errordecreases ( and time increases) with increasein sample size by running query on smallersamples.

• The scaling rate depends on query structurelike JOINS, GROUP BY, Physical dataplacement and underlying data distribution.

57

Page 58: Blinkdb

Error Profile• Given Q’s error constraints : Idea is to predict the

size of smallest sample that satisfies constraints.

• Variance and Closed form aggregate functions areestimated using standard closed form formulas.

• Also BlinkDB estimates query selectivity, inputdata distribution by running queries on smallersubsamples.

• Number of rows are thus calculated usingStatistical Error estimates.

58

Page 59: Blinkdb

59

Page 60: Blinkdb

Latency Profile• Given Q’s time constraints : Idea is to predict the

maximum size sample that we should run query onwithin the constraints.

• Value of ‘n’ depends on input data, physicalplacement of disk, query structure and availableresources. So as a simplification : BlinkDB simplypredicts ‘n’ by assuming latency scales linearly ininput size.

• For very small in-memory samples : BlinkDB runs afew smaller samples until performance seems togrow linearly and then estimate the linear scalingconstants. 60

Page 61: Blinkdb

Correcting Bias

• Running a query on a non-uniform sampleintroduces certain amount of statistical biasand introduce subtle inaccuracies.

• Solution : BlinkDB periodically replaces thesamples use using a low priority backgroundtask which periodically ( daily ) samples theoriginal data which are then used by thesystem.

61

Page 62: Blinkdb

Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

62

Page 63: Blinkdb

Error EstimationClosed Form Aggregate Functions- Central Limit Theorem- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

Generalized Aggregate Functions- Statistical Bootstrap - Applicable to complex and nested

queries, UDFs, joins etc.- Very computationally expensive. 63

Page 64: Blinkdb

But we are not done yet …

• Statistical functions like CLT and Bootstrapoperate under a set of assumption on query /data.

• We need to have some correctness verifiers!

64

Page 65: Blinkdb

What is BlinkDB?A framework built on Shark and Spark that …

- creates and maintains a variety of samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

65

Page 66: Blinkdb

Kleiner’s DiagnosticsEr

ror

Sample Size

More Data Higher Accuracy

300 Data Points 97% Accuracy

[KDD 13]

66

Page 67: Blinkdb

300 Data Points ≈ 30KQueries for Bootstrap !

67

Page 68: Blinkdb

So In an Approximate QP :

• One query that estimates the answer.

• Hundred Queries on Resample of data thatcomputes the error.

• Tens of Thousand of Queries to verify if thiserror is correct.

• BAD PERFORMANCE !

• Solution: Single Pass Executionframework. 68

Page 69: Blinkdb

What is BlinkDB?A framework built on Shark and Spark that …

- creates and maintains a variety of samples from underlying data

- returns fast, approximate answers with error bars by executing queries on samples of data

- verifies the correctness of the error bars that it returns at runtime

69

Page 70: Blinkdb

BlinkDB Implementation

70

Page 71: Blinkdb

BlinkDB Architecture

Hadoop Storage (e.g., HDFS, Hbase, Presto)

Metastore

Hadoop/Spark/Presto

SQL Parser

Query Optimizer

Physical Plan

UDFs

Execution

Driver

Command-line Shell Thrift/JDBC

Page 72: Blinkdb

72

Page 73: Blinkdb

Implementation Changes• Additions in Query Language Parser.

• Parser can trigger a sample creation andmaintenance module.

• A sample selection module that re-writes the queryand assigns it an approximately sized sample.

• Uncertainty module to modify all pre-existingaggregation function to return error bars andconfidence intervals.

• A module periodically samples from the originaldata, creating new samples which are then used bythe system. (Co-Relation +Workload Changes) 73

Page 74: Blinkdb

BlinkDB Evaluation

74

Page 75: Blinkdb

BlinkDB Vs. No Sampling

2.5 TB

from

Cache

7.5 TB

from Disk

Log Scale

75

Page 76: Blinkdb

Scaling BlinkDB

76

Each query operates on 100N GB of data.

Page 77: Blinkdb

Response Timeand

Error Bounds …

7720 Conviva queries averaged over 10runs

Page 78: Blinkdb

Play with BlinkDB! https://github.com/sameeragarwal/blinkdb

78

Page 79: Blinkdb

Take Away …

• The only way for now to escape memoryperformance barrier is to use ApproximateComputing .

• A huge role to play in exploratory dataanalysis.

• BlinkDB provides a framework for AQP +Error Bars + Verifies them.

• Great Performance on real world workloads.79

Page 80: Blinkdb

Personal Takeaway : Take aSTATISTICS class!

80

Page 81: Blinkdb

Credits

These slides are derived from Sameer Agarwal’spresentation : http://goo.gl/cvVb1X

81

Page 82: Blinkdb

Questions ?THANK YOU!

82