flexible database generators nicolas bruno surajit chaudhuri dmx group microsoft research vldb’05

Flexible Database Generators

Nicolas Bruno Surajit Chaudhuri

DMX GroupMicrosoft Research

VLDB’05

2

DBMS Components

DBMS

Optimizer

Cost Model

Statistics Management

Statistics

Automatic Physical Design Tool

Multidimensional STHoles Histograms

Statistics on Query Expressions

DBMS are complex pieces of software Components still evolving/being added

3

DBMS

Optimizer

Cost Model

Statistics Management

Statistics

Automatic Physical Design ToolMultidimensional

STHoles Histograms

Statistics on Query Expressions

Evaluating New Components Functional vs. quality evaluation Black-box vs. gray-box evaluation Steps:

Generate data Generate workload Evaluate improvement

Manual task: Time Consuming, sometimes difficult, not reusable

4

DGL (Data Generation Language)

Special purpose specification languageBased on iteratorsFunctional flavor (can compose iterators) Interface with DBMS for scalabilityFlexible and extensible

SQL extensions using DGL annotationsExtend CREATE TABLE statementsSpecify how a database is populated

Inter-table dependencies are possible

5

Populating a Database with DGL

DGL Program

DGLCompiler

C++ Program

C++ Compiler/Linker

DGL Primitives and Runtime Library

Data Generator

Annotated Schema

Preprocessor

CREATE TABLE R ( a INT, b INT, c INT )POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"),

100 ) )

LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 )IN PersistToExisting ( R_ad[0] & R_bc , "R" )

LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 )IN PersistToExisting ( R_ad[0] & R_bc , "R" )

- User-defined Iterator plumbing.- Buffering of intermediate results.- DB bulk-loading, querying, etc.

6

DGL: Data Types Base: Integers, Strings, Dates, etc. Rows: heterogeneous sequence of scalars

Inherit operations from scalar types( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’]

Iterators: key data type Step(1,5) returns <[1],[2],[3],[4],[5]> Constant( [1,2] ) returns <[1,2], [1,2], [1,2], … > Iterators inherit operations from rows:

Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]>

Associative tables: main memory, random access

7

Primitive Iterators

Others: Duplicate elimination, union, etc.

Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc. Uniform( Constant([0,0]), Constant ([10,10]) )= Uniform ( [0,0], [10,10] ) (implicit casts).

SQL: Bridge DGL and DBMSPersist (expression, [table name])Query (parameterized query, iterator1, …)

8

Expressions and Functions Expressions (acyclic reference graph)

Functions

9

Annotated Schemas with DGL Annotations: specify how to populate a

database

From annotations to DGL:Create single DGL fragment for all annotationsVertical partitions for inter-table dependenciesQuery rewritingProxy introduction

10

Populating a Database with DGL

DGL Program

DGLCompiler

C++ Program

C++ Compiler/Linker

DGL Primitives and Runtime Library

Data Generator

Annotated Schema

Preprocessor

CREATE TABLE R ( a INT, b INT, c INT )POPULATE ( (a,d) = Step(1, 100) & UniformInt(0, 10), (b,c) = Duplicate( Query("SELECT DISTINCT(R.d) FROM R"),

100 ) )

LET R_ad_Prx = Persist ( Step(1, 100) & UniformInt(0, 10) ), R_ad = Query ( "SELECT * FROM <<0>>", R_ad_Prx ), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad_Prx ), 100 )IN PersistToExisting ( R_ad[0] & R_bc , "R" )

LET R_ad = Step(1, 100) & UniformInt(0, 10), R_bc = Duplicate ( Query ( "SELECT DISTINCT(<<0>>.v1) FROM <<0>>", R_ad ), 100 )IN PersistToExisting ( R_ad[0] & R_bc , "R" )

- User-defined Iterator plumbing.- Buffering of intermediate results.- DB bulk-loading, querying, etc.

11

CREATE TABLE testDGL ( posX float, posY float, posZ float

) POPULATE 1000 (gaussian = multiGauss([0,0,0],

[1000,1000,1000], [@1,@1,@1], 1.0, @0, 10000 ),

block = Uniform([100,100,100], [200,300,400]),

(posX, posY, posZ) = ProbUnion(gaussian, block, @2)

)

FUN multiGauss(lo, hi, sigma, z, p, N) = LET ctrList = Top( Uniform(lo, hi), p ), idxs = Zipfian(z, p), centers = TApply(ctrList, idxs) IN Top( Normal(centers, sigma), N)

class ProbUnion:public CIterator {public: virtual void _open() { ... } virtual bool _getNext(CRow* row) { ... }};

SQ

L A

nnot

atio

nD

GL

Fun

ctio

nC

++

UD

I

parameterstemporary columns

User defined Iterator

traditional DDL

target size

Distributions

Lo Hi p z σ N

Uniform

Top Zipfian

TApply

Normal

Top

Example: Multidimensional Distributions

0

250

500

750

1000

0 250 500 750 1000

posX

0

250

500

750

1000

0 250 500 750 1000posX

0

250

500

750

1000

0 250 500 750 1000posY

0

250

500

750

1000

0 250 500 750 1000

posX

0

250

500

750

1000

0 250 500 750 1000posX

0

250

500

750

1000

0 250 500 750 1000posY

0

250

500

750

1000

0 250 500 750 1000

posX

0

250

500

750

1000

0 250 500 750 1000posX

0

250

500

750

1000

0 250 500 750 1000posY

12

Example: Inter-table dependencies Employees’ ages normally distributed around 40 Employees’ departments follows Zipfian distribution Employees’ bonus distributed around department 'category‘, which depends on budget

Dept budget is normally distributed around 10000 * size (number of employees) Dept building follows Zipfian distribution

CREATE TABLE employee ( empID int, age int, deptID int, bonus int) POPULATE 10000 ( empID = Step(0,10000), age = Normal(40.0,5.0), deptID = Zipfian(0.75, 50), empBaseBonus = Query(" SELECT D.budget / 1000 FROM employee JOIN dept ON employee.deptID = dept.deptID ORDER BY employee.empID"), bonus = empBaseBonus * Uniform(0.5,1.5) )

CREATE TABLE dept ( deptID int, budget float, building int) POPULATE ( (deptID, size) = Query(" SELECT employee.deptID, count(*) FROM employee GROUP BY employee.deptID") budget = Normal(10000*size, 5000) building = Zipfian(1.0, 20))

LET employeeempID = Top ( Step ( 0, 10000 ), employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ), employeedeptIDProxy = Persist ( Top ( Zipfian ( 0.75, 50 ), 10000 ) ), employeedeptID = Query ( "SELECT * FROM <<0>>", employeedeptIDProxy ), deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ( 10000 * deptdeptIDsize_1, 5000 ), tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), tmp2Proxy = Persist ( employeedeptID & employeeempID ), employeeempBaseBonus = Top ( Query ( "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON <<1>>.v0 = <<0>>.v1 ORDER BY <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ),

...

IN Union ( Persist ( employeeempID & employeeage & employeedeptID & employeebonus, "employee" ), Persist ( deptdeptIDsize_0 & deptbudget & deptbuilding, "dept" ) )

13

Evaluation Model Iterator model (open/getNext/close)

Program is DAGDepending on consumers, buffering is required

In-memory circular queue that spills to disk

14

Examples

Wisconsin Benchmark

Multi-gaussian

Skewed primary/foreign key joins

15

Number of items per order follows a Zipfian distribution.Ship date occurs k days after order date, where k follows Zipfian.

Commit and receipt dates follow a bi-gaussian distribution after ship date.

Item discounts are correlated to the global number of parts soldTop 100 customers’ debt is normally distributed around 3*balances.Remaining customers, around balances/2.

All parts in an order are sold by suppliers that live in the samecountry as the customer.Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’

Complex TPC-H Examples

16

Initial Performance Results

Populate 1GB databases with various generators

17

Conclusion

Creating datasets for quality evaluation of new database components is time-consuming

DGL is expressive and easy to use SQL annotations reduce time needed to create

and populate databases with non-trivial correlations

flexible database generators nicolas bruno surajit chaudhuri dmx group microsoft research vldb’05

Documents