benchmark databases for testing big-data analytics in ......benchmark databases for testing big-da...
TRANSCRIPT
Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
Rong HuangRada Chirkova
Yahya Fathi
ICA CON 2012April 20, 2012
North Carolina State University
Graduate Program in Operations Research
2
Background
One major advantage of using computing clouds lies in their applicability to large-scale data warehousing and analytics.
Computing clouds can host very large amounts of data and provide efficient parallelized processing of complex analytics queries on the data.
Enterprise data-cloud solutions for large-scale data warehousing and analytics are highly desirable.
Our goal is to provide synthetically generated benchmark databases for testing the performance and other processing aspects of database systems in a computing-cloud environment.
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
Pos: itemID storeID date amount
Items: itemID name category
3
Relational Storage of Data
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
“Give me recent total sales for all products in the Bay
Area”
Q: “Give me total sales by store ID for all
appliances”
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
4
Query Processing
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
5
Query Processing
storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …
SELECT storeID, SUM(amount) FROM pos P, items I WHERE P.itemID = I.itemID
AND category = 'appliances’GROUP BY storeID;
Queries and Views
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
6
Q: “Give me total sales by store ID for all appliances”
V: “total sales by store ID and by item category”
storeID category SUM(amount)13357 appliances $27,142.98 13357 clothing $45,135.24 13357 electronics $50,245.64 28690 appliances $54,124.14 28690 clothing $60,938.21 28690 electronics $82,623.64 11561 appliances $41,225.26 … … …
storeID SUM(amount)13357 $27,142.9828690 $54,124.1411561 $41,225.26… …
7
View lattice
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
Views with grouping and aggregation on a given relation. Given a -attribute dataset, the number of views is 2 . Measure the size of each view by its number of rows.
{c,d}
{b,c,d}{a,b,c}
{a,b,c,d}
{a,b,d} {a,c,d}
{b,d}{b,c}{a,c}{a,b}
{b} {c}4 5
7 12 10 8 9
13 20 15 16
25
8
TPC-H Datasets
The TPC-H synthetic database generator is widely recognized as a standard benchmark database generator for data analytics.
We have discovered in our work, the TPC-H benchmark has potential shortcomings when used to test the quality of algorithms developed for efficient processing of complex analytics queries.
The TPC-H dataset does not distinguish between view sizes.
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
9
The Potential Shortcomings of TPC-H datasets The size of a great number of views is close to that of the largest
view.
We would always prefer to store the largest view.
Total numberofattributes
Number of views within 0.1% size difference from the largest view
Total numberofviews
Ratio
7 52 128 40.60%13 6,192 8,192 75.60%15 27,318 32,768 83.40%17 115,162 131,072 87.90%
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
300,000
300,000
300,000
300 50
10
Our Contribution
We define three types of synthetic datasets, which do not have the shortcomings that we have observed in the TPC-H data.
We introduce algorithms for generating all the three types of datasets in any range of data sizes, which allows one to use the datasets in a variety of configurations and scales of cloud environments.
Our datasets are complementary to the TPC-H datasets in testing the processing performance of complex analytics queries in the cloud environments.
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
11
The Symmetric Synthetic Datasets
: total number of attributes in the dataset : number of values for each attribute Number of rows:
Example 1: 3, 2 3, 2Attributes:A B C0 2 41 3 5
Number of rows: 2
A B C0 2 40 2 50 3 40 3 51 2 41 2 51 3 41 3 5
3, 2
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
12
Views in the Symmetric Synthetic Datasets The size of each -attribute view over , is . The size of an ancestor is at least times the size of its
descendant. Example 1 (cont’d): 3, 2
{A,B}
{A,B,C}
{A,C} {B,C}
{C}{B}{A} 2 2 2
4 4 4
8
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
13
Symmetric Synthetic Datasets
Symmetric properties of the datasets
Significant size difference between each pair of ancestor-descendant views.
The datasets does not distinguish between the sizes of the views with same number of attributes.
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
14
Type I Non-Symmetric Synthetic Datasets …,
The number of values of each attribute differs:, , …,
Example 2: 3; 2, 3, 4 3, 2, 3, 4Attributes:A B C0 2 51 3 6
4 78
Number of rows: 2 ∗ 3 ∗ 4 24
A B C0 2 50 2 60 2 70 2 8...
.
.
.
.
.
.1 4 71 4 8
3; 2, 3, 4
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
15
Views in the Type I Non-Symmetric Synthetic Datasets A -attribute view , , … , with values , , … , The size (number of rows):
∙ ∙ ⋯ ∙ Example 2 (cont’d): 3; 2, 3, 4
{A,B}
{A,B,C}
{A,C} {B,C}
{C}{B}{A} 2 3 4
6 8 12
24
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
16
Type I Non-Symmetric Synthetic Datasets
Type I non-symmetric synthetic dataset distinguishes between any pair of view sizes.
Relatively large difference in size between each pair of ancestor-descendant views. The size of each view is at least twice of the size of its descendant.
We would always prefer to store the answer of each query.
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
17
Type II Non-Symmetric Synthetic Datasets
Objectives: break the symmetric properties and reduce the size difference between adjacent ancestor-descendant pair of views.
Conduct an elimination procedure over the rows in a given type I non-symmetric synthetic dataset.
For each attribute , we conduct a two-step sub-elimination process
Step 1: Eliminate each row with probability Step 2: For each row r that is eliminated in step 1, we also
eliminate the rows in the master table with the same values as r of all attributes except
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
18
Type II Non-Symmetric Synthetic Datasets
Input: ; , … , and
Choose , … , , such that
Output: a type II non-symmetric synthetic dataset , such that the expected number of rows in is greater than or equal to
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
19
Type II Non-Symmetric Synthetic Datasets Example 3: Input 3; 2, 3, 4 and 10. Choose 0.9
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
A B C0 2 50 2 60 2 70 2 80 3 50 3 60 3 70 3 80 4 50 4 60 4 70 4 8
A B C1 2 51 2 61 2 71 2 81 3 51 3 61 3 71 3 81 4 51 4 61 4 71 4 8
A B C0 2 60 2 80 3 60 3 80 4 50 4 81 2 61 2 71 3 61 3 7
3; 2, 3, 4′
20
Views in the Type II Non-Symmetric Synthetic Datasets Example 3 (cont’d):
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
{A,B}
{A,B,C}
{A,C} {B,C}
{C}{B}{A} 2 3 4
5 5 8
10
21
Experimental Results
0.00%
100.00%
200.00%
300.00%
400.00%
500.00%
600.00%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
A performance measure
β
Type I non-symmetric dataset
TPC-H dataset
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
22
Conclusion
We define a symmetric synthetic dataset and two types of non-symmetric synthetic datasets.
We studied shortcomings of the TPC-H datasets in testing algorithms devised for improving query-processing performance for complex queries posed on large-scale data.
We compare these datasets experimentally with our proposed synthetic datasets in a setting for testing in such algorithms.
All the synthetic datasets that we proposed in this paper are beneficial for testing algorithms devised for improving query-processing performance in cloud computing
Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments
23
Thank You!
23Rong Huang, Rada Chirkova, and Yahya Fathi Benchmark Databases for Testing Big-Data Analytics In Cloud Environments