pdgf: a data generator for cloud-scale benchmarking

20
T. Rabl, M. Frank, H. Mousselly Sergieh, and H. Kosch 17.09.2010 Second TPC Technology Conference on Performance Evaluation & Benchmarking

Upload: tilmann-rabl

Post on 11-May-2015

655 views

Category:

Technology


0 download

DESCRIPTION

This is a presentation that was held at the Second TPC Technology Conference on Performance Evaluation & Benchmarking, 17.09.2010, Singapore. Abstract: In many fields of research and business data sizes are breaking the petabyte barrier. This imposes new problems and research possibilities for the database community. Usually, data of this size is stored in large clusters or clouds. Although clouds have become very popular in recent years, there is only little work on benchmarking cloud applications. We present a data generator for cloud sized applications. Its architecture makes the data generator easy to extend and to configure. A key feature is the high degree of parallelism that allows linear scaling for arbitrary numbers of nodes.We show how distributions, relationships and dependencies in data can be computed in parallel with linear speed up.

TRANSCRIPT

Page 1: PDGF: A Data Generator for Cloud-Scale Benchmarking

T. Rabl, M. Frank, H. Mousselly Sergieh, and H. Kosch17.09.2010Second TPC Technology Conference on Performance Evaluation & Benchmarking

Page 2: PDGF: A Data Generator for Cloud-Scale Benchmarking

Data sizes grow beyond petabyte barrier Cloud computing to the rescue

Need for cloud-scale benchmarking Need for realistic, cloud-scale data sets

Problems: Generating petabytes Storing petabytes Transporting petabytes

Solution: Parallel, on-site generation

2A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 3: PDGF: A Data Generator for Cloud-Scale Benchmarking

3 tables with primary and foreign keys Non-uniform distributions (lognormal) Replicated data

How to generate consistent data in parallel?3A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 4: PDGF: A Data Generator for Cloud-Scale Benchmarking

No references Only uncorrelated data Simple statistical references

Scanned references Read data to generate reference Generate tuple pairs

Computed references Data is generated deterministically Compute data to generate reference

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 4

Parallel Data Generation Framework

Page 5: PDGF: A Data Generator for Cloud-Scale Benchmarking

Parallel pseudo random number generator Deterministic Fast skip ahead

Generation of values as a function Deterministic

Seeds + row id + generator allows recalculation Every value can be computed independently

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 5

Page 6: PDGF: A Data Generator for Cloud-Scale Benchmarking

Seeding strategy All seeds can be cached Generation of value in row n with n-th

random number

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 6

useruser_idrow_id degree_program …

123

Table RNGseed

Column RNGseed

Row RNGseed

Generatorrn

c_id

t_id

r_id

Page 7: PDGF: A Data Generator for Cloud-Scale Benchmarking

Reference generation Lognormal indexes row_id of user Equal seeds and lognormal generator for user_id

A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 7

seminar_useruser_idrow_id degree_program …

123

Table RNGseed

Column RNGseed

Row RNGseed

Lognormalrn

id

id

r_id

Row RNGseed

Generatorrn

r_id

Page 8: PDGF: A Data Generator for Cloud-Scale Benchmarking

Java based Plug-in concept Generators Distributions RNGs Output Scheduler

TPC-H plug-in

8A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 9: PDGF: A Data Generator for Cloud-Scale Benchmarking

XML file Reflects SQL schema Tables Fields

Seed Size Scale factor Output

<project name="simpleUserSeminar">[..]<table name=“seminar_user">

<size>201754</size><fields><field name="user_id"><type>java.sql.Types.INTEGER</type><reference><referencedField>user_id</referencedField><referencedTable>user</referencedTable></reference><generator name="DefaultReferenceGenerator"><distribution name="LogNormal"><mu>7.60021</mu><sigma>1.40058</sigma></distribution></generator></field><field name="degree_program">

[..]

9A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 10: PDGF: A Data Generator for Cloud-Scale Benchmarking

Every value can be computed independently No communication Workload distribution Configuration for every node Automatic distribution

<?xml version="1.0" encoding="UTF-8"?><nodeConfig><nodeNumber>1</nodeNumber><nodeCount>2</nodeCount><workers>2</workers>

</nodeConfig>

Processors

Nodes

Cluster/Cloud seminar_user

100000 rows

Row 1 – 50000

Row 1 –25000

Row 25001 –50000

Row 50001 – 100000

Row 50001 –

75000Row 75001 –

10000010A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 11: PDGF: A Data Generator for Cloud-Scale Benchmarking

16 node HPC cluster 2 Intel Xeon QuadCore processors 16GB RAM 2 x 74GB HDD, RAID 0

SetQuery data set 1 table “Bench”, 21 columns 12 Random numbers, 8 strings

TPC-H data set 8 tables, 61 columns Data types: integer, char, varchar, decimal, date

11A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 12: PDGF: A Data Generator for Cloud-Scale Benchmarking

SetQuery data set 100 GB data set (SF 460) 1 to 16 nodes

12A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 13: PDGF: A Data Generator for Cloud-Scale Benchmarking

TPC-H data set Data sizes: 1 GB, 10 GB, 100 GB

Single node (8 cores)13A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 14: PDGF: A Data Generator for Cloud-Scale Benchmarking

Parallel data generation framework Linear speed up Fast generation Large data sets Realistic data

Independent generation of tables / colums / values

Easy configuration and extension

14A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 15: PDGF: A Data Generator for Cloud-Scale Benchmarking

More implementation generators, distributions Benchmarks Graphical user interface Scheduler

Query generator Consistent inserts, updates, deletes Precomputed query results Time series

15A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 16: PDGF: A Data Generator for Cloud-Scale Benchmarking
Page 17: PDGF: A Data Generator for Cloud-Scale Benchmarking

SetQuery data set Data sizes: 220 MB, 2.2 GB, 10 GB, 22 GB, 100 GB

Single node17A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 18: PDGF: A Data Generator for Cloud-Scale Benchmarking

TPC-H data set Data sizes: 1 GB, 10 GB, 100 GB, 1TB

1, 10, 16 nodes18A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 19: PDGF: A Data Generator for Cloud-Scale Benchmarking

19A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl

Page 20: PDGF: A Data Generator for Cloud-Scale Benchmarking

20A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl