benchmark databases for testing big-data analytics in ......benchmark databases for testing big-da...

23
Benchmark Databases for Testing Big-Data Analytics In Cloud Environments Rong Huang Rada Chirkova Yahya Fathi ICA CON 2012 April 20, 2012 North Carolina State University Graduate Program in Operations Research

Upload: others

Post on 24-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Rong HuangRada Chirkova

Yahya Fathi

ICA CON 2012April 20, 2012

North Carolina State University

Graduate Program in Operations Research

Page 2: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

2

Background

One major advantage of using computing clouds lies in their applicability to large-scale data warehousing and analytics.

Computing clouds can host very large amounts of data and provide efficient parallelized processing of complex analytics queries on the data.

Enterprise data-cloud solutions for large-scale data warehousing and analytics are highly desirable.

Our goal is to provide synthetically generated benchmark databases for testing the performance and other processing aspects of database systems in a computing-cloud environment.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 3: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

Pos: itemID storeID date amount

Items: itemID name category

3

Relational Storage of Data

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 4: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

“Give me recent total sales for all products in the Bay

Area”

Q: “Give me total sales by store ID for all

appliances”

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

4

Query Processing

Page 5: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

5

Query Processing

storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …

SELECT storeID, SUM(amount) FROM pos P, items I WHERE P.itemID = I.itemID

AND category = 'appliances’GROUP BY storeID;

Page 6: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

Queries and Views

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

6

Q: “Give me total sales by store ID for all appliances”

V: “total sales by store ID and by item category”

storeID category SUM(amount)13357 appliances $27,142.98 13357 clothing $45,135.24 13357 electronics $50,245.64 28690 appliances $54,124.14 28690 clothing $60,938.21 28690 electronics $82,623.64 11561 appliances $41,225.26 … … …

storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …

Page 7: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

7

View lattice

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Views with grouping and aggregation on a given relation. Given a -attribute dataset, the number of views is 2 . Measure the size of each view by its number of rows.

{c,d}

{b,c,d}{a,b,c}

{a,b,c,d}

{a,b,d} {a,c,d}

{b,d}{b,c}{a,c}{a,b}

{b} {c}4 5

7 12 10 8 9

13 20 15 16

25

Page 8: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

8

TPC-H Datasets

The TPC-H synthetic database generator is widely recognized as a standard benchmark database generator for data analytics.

We have discovered in our work, the TPC-H benchmark has potential shortcomings when used to test the quality of algorithms developed for efficient processing of complex analytics queries.

The TPC-H dataset does not distinguish between view sizes.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 9: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

9

The Potential Shortcomings of TPC-H datasets The size of a great number of views is close to that of the largest

view.

We would always prefer to store the largest view.

Total numberofattributes

Number of views within 0.1% size difference from the largest view

Total numberofviews

Ratio

7 52 128 40.60%13 6,192 8,192 75.60%15 27,318 32,768 83.40%17 115,162 131,072 87.90%

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

300,000

300,000

300,000

300 50

Page 10: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

10

Our Contribution

We define three types of synthetic datasets, which do not have the shortcomings that we have observed in the TPC-H data.

We introduce algorithms for generating all the three types of datasets in any range of data sizes, which allows one to use the datasets in a variety of configurations and scales of cloud environments.

Our datasets are complementary to the TPC-H datasets in testing the processing performance of complex analytics queries in the cloud environments.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 11: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

11

The Symmetric Synthetic Datasets

: total number of attributes in the dataset : number of values for each attribute Number of rows:

Example 1: 3, 2 3, 2Attributes:A B C0 2 41 3 5

Number of rows: 2

A B C0 2 40 2 50 3 40 3 51 2 41 2 51 3 41 3 5

3, 2

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 12: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

12

Views in the Symmetric Synthetic Datasets The size of each -attribute view over , is . The size of an ancestor is at least times the size of its

descendant. Example 1 (cont’d): 3, 2

{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 2 2

4 4 4

8

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 13: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

13

Symmetric Synthetic Datasets

Symmetric properties of the datasets

Significant size difference between each pair of ancestor-descendant views.

The datasets does not distinguish between the sizes of the views with same number of attributes.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 14: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

14

Type I Non-Symmetric Synthetic Datasets …,

The number of values of each attribute differs:, , …,

Example 2: 3; 2, 3, 4 3, 2, 3, 4Attributes:A B C0 2 51 3 6

4 78

Number of rows: 2 ∗ 3 ∗ 4 24

A B C0 2 50 2 60 2 70 2 8...

.

.

.

.

.

.1 4 71 4 8

3; 2, 3, 4

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 15: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

15

Views in the Type I Non-Symmetric Synthetic Datasets A -attribute view , , … , with values , , … , The size (number of rows):

∙ ∙ ⋯ ∙ Example 2 (cont’d): 3; 2, 3, 4

{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 3 4

6 8 12

24

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 16: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

16

Type I Non-Symmetric Synthetic Datasets

Type I non-symmetric synthetic dataset distinguishes between any pair of view sizes.

Relatively large difference in size between each pair of ancestor-descendant views. The size of each view is at least twice of the size of its descendant.

We would always prefer to store the answer of each query.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 17: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

17

Type II Non-Symmetric Synthetic Datasets

Objectives: break the symmetric properties and reduce the size difference between adjacent ancestor-descendant pair of views.

Conduct an elimination procedure over the rows in a given type I non-symmetric synthetic dataset.

For each attribute , we conduct a two-step sub-elimination process

Step 1: Eliminate each row with probability Step 2: For each row r that is eliminated in step 1, we also

eliminate the rows in the master table with the same values as r of all attributes except

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 18: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

18

Type II Non-Symmetric Synthetic Datasets

Input: ; , … , and

Choose , … , , such that

Output: a type II non-symmetric synthetic dataset , such that the expected number of rows in is greater than or equal to

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 19: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

19

Type II Non-Symmetric Synthetic Datasets Example 3: Input 3; 2, 3, 4 and 10. Choose 0.9

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

A B C0 2 50 2 60 2 70 2 80 3 50 3 60 3 70 3 80 4 50 4 60 4 70 4 8

A B C1 2 51 2 61 2 71 2 81 3 51 3 61 3 71 3 81 4 51 4 61 4 71 4 8

A B C0 2 60 2 80 3 60 3 80 4 50 4 81 2 61 2 71 3 61 3 7

3; 2, 3, 4′

Page 20: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

20

Views in the Type II Non-Symmetric Synthetic Datasets Example 3 (cont’d):

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 3 4

5 5 8

10

Page 21: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

21

Experimental Results

0.00%

100.00%

200.00%

300.00%

400.00%

500.00%

600.00%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

A performance measure

β

Type I non-symmetric dataset

TPC-H dataset

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 22: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

22

Conclusion

We define a symmetric synthetic dataset and two types of non-symmetric synthetic datasets.

We studied shortcomings of the TPC-H datasets in testing algorithms devised for improving query-processing performance for complex queries posed on large-scale data.

We compare these datasets experimentally with our proposed synthetic datasets in a setting for testing in such algorithms.

All the synthetic datasets that we proposed in this paper are beneficial for testing algorithms devised for improving query-processing performance in cloud computing

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Page 23: Benchmark Databases for Testing Big-Data Analytics In ......Benchmark Databases for Testing Big-Da ta Analytics In Cloud Environments 6 Q: “Give me total sales by store ID for all

23

Thank You!

23Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments