benchmark databases for testing big-data analytics in ......benchmark databases for testing big-da...

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Rong HuangRada Chirkova

Yahya Fathi

ICA CON 2012April 20, 2012

North Carolina State University

Graduate Program in Operations Research

2

Background

One major advantage of using computing clouds lies in their applicability to large-scale data warehousing and analytics.

Computing clouds can host very large amounts of data and provide efficient parallelized processing of complex analytics queries on the data.

Enterprise data-cloud solutions for large-scale data warehousing and analytics are highly desirable.

Our goal is to provide synthetically generated benchmark databases for testing the performance and other processing aspects of database systems in a computing-cloud environment.

Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Pos: itemID storeID date amount

Items: itemID name category

3

Relational Storage of Data


“Give me recent total sales for all products in the Bay

Area”

Q: “Give me total sales by store ID for all

appliances”


4

Query Processing


5

Query Processing

storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …

SELECT storeID, SUM(amount) FROM pos P, items I WHERE P.itemID = I.itemID

AND category = 'appliances’GROUP BY storeID;

Queries and Views


6

Q: “Give me total sales by store ID for all appliances”

V: “total sales by store ID and by item category”

storeID category SUM(amount)13357 appliances $27,142.98 13357 clothing $45,135.24 13357 electronics $50,245.64 28690 appliances $54,124.14 28690 clothing $60,938.21 28690 electronics $82,623.64 11561 appliances $41,225.26 … … …

storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …

7

View lattice


Views with grouping and aggregation on a given relation. Given a -attribute dataset, the number of views is 2 . Measure the size of each view by its number of rows.

{c,d}

{b,c,d}{a,b,c}

{a,b,c,d}

{a,b,d} {a,c,d}

{b,d}{b,c}{a,c}{a,b}

{b} {c}4 5

7 12 10 8 9

13 20 15 16

25

8

TPC-H Datasets

The TPC-H synthetic database generator is widely recognized as a standard benchmark database generator for data analytics.

We have discovered in our work, the TPC-H benchmark has potential shortcomings when used to test the quality of algorithms developed for efficient processing of complex analytics queries.

The TPC-H dataset does not distinguish between view sizes.


9

The Potential Shortcomings of TPC-H datasets The size of a great number of views is close to that of the largest

view.

We would always prefer to store the largest view.

Total numberofattributes

Number of views within 0.1% size difference from the largest view

Total numberofviews

Ratio

7 52 128 40.60%13 6,192 8,192 75.60%15 27,318 32,768 83.40%17 115,162 131,072 87.90%


300,000

300,000

300,000

300 50

10

Our Contribution

We define three types of synthetic datasets, which do not have the shortcomings that we have observed in the TPC-H data.

We introduce algorithms for generating all the three types of datasets in any range of data sizes, which allows one to use the datasets in a variety of configurations and scales of cloud environments.

Our datasets are complementary to the TPC-H datasets in testing the processing performance of complex analytics queries in the cloud environments.


11

The Symmetric Synthetic Datasets

: total number of attributes in the dataset : number of values for each attribute Number of rows:

Example 1: 3, 2 3, 2Attributes:A B C0 2 41 3 5

Number of rows: 2

A B C0 2 40 2 50 3 40 3 51 2 41 2 51 3 41 3 5

3, 2


12

Views in the Symmetric Synthetic Datasets The size of each -attribute view over , is . The size of an ancestor is at least times the size of its

descendant. Example 1 (cont’d): 3, 2

{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 2 2

4 4 4

8


13

Symmetric Synthetic Datasets

Symmetric properties of the datasets

Significant size difference between each pair of ancestor-descendant views.

The datasets does not distinguish between the sizes of the views with same number of attributes.


14

Type I Non-Symmetric Synthetic Datasets …,

The number of values of each attribute differs:, , …,

Example 2: 3; 2, 3, 4 3, 2, 3, 4Attributes:A B C0 2 51 3 6

4 78

Number of rows: 2 ∗ 3 ∗ 4 24

A B C0 2 50 2 60 2 70 2 8...

.

.

.

.

.

.1 4 71 4 8

3; 2, 3, 4


15

Views in the Type I Non-Symmetric Synthetic Datasets A -attribute view , , … , with values , , … , The size (number of rows):

∙ ∙ ⋯ ∙ Example 2 (cont’d): 3; 2, 3, 4

{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 3 4

6 8 12

24


16

Type I Non-Symmetric Synthetic Datasets

Type I non-symmetric synthetic dataset distinguishes between any pair of view sizes.

Relatively large difference in size between each pair of ancestor-descendant views. The size of each view is at least twice of the size of its descendant.

We would always prefer to store the answer of each query.


17

Type II Non-Symmetric Synthetic Datasets

Objectives: break the symmetric properties and reduce the size difference between adjacent ancestor-descendant pair of views.

Conduct an elimination procedure over the rows in a given type I non-symmetric synthetic dataset.

For each attribute , we conduct a two-step sub-elimination process

Step 1: Eliminate each row with probability Step 2: For each row r that is eliminated in step 1, we also

eliminate the rows in the master table with the same values as r of all attributes except


18

Type II Non-Symmetric Synthetic Datasets

Input: ; , … , and

Choose , … , , such that

Output: a type II non-symmetric synthetic dataset , such that the expected number of rows in is greater than or equal to


19

Type II Non-Symmetric Synthetic Datasets Example 3: Input 3; 2, 3, 4 and 10. Choose 0.9


A B C0 2 50 2 60 2 70 2 80 3 50 3 60 3 70 3 80 4 50 4 60 4 70 4 8

A B C1 2 51 2 61 2 71 2 81 3 51 3 61 3 71 3 81 4 51 4 61 4 71 4 8

A B C0 2 60 2 80 3 60 3 80 4 50 4 81 2 61 2 71 3 61 3 7

3; 2, 3, 4′

20

Views in the Type II Non-Symmetric Synthetic Datasets Example 3 (cont’d):


{A,B}

{A,B,C}

{A,C} {B,C}

{C}{B}{A} 2 3 4

5 5 8

10

21

Experimental Results

0.00%

100.00%

200.00%

300.00%

400.00%

500.00%

600.00%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

A performance measure

β

Type I non-symmetric dataset

TPC-H dataset


22

Conclusion

We define a symmetric synthetic dataset and two types of non-symmetric synthetic datasets.

We studied shortcomings of the TPC-H datasets in testing algorithms devised for improving query-processing performance for complex queries posed on large-scale data.

We compare these datasets experimentally with our proposed synthetic datasets in a setting for testing in such algorithms.

All the synthetic datasets that we proposed in this paper are beneficial for testing algorithms devised for improving query-processing performance in cloud computing


23

Thank You!

23Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

benchmark databases for testing big-data analytics in ......benchmark databases for testing big-da...

Documents