bigbench: big data benchmark proposal

BigBench: Big Data Benchmark Proposal

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen

2

• End to end benchmark• Based on a product retailer• Focus on

- Parallel DBMS - MR engines

• Initial work presented at 1st WBDB, San Jose• Collaboration with Industry & Academia

- Teradata, University of Toronto, InfoSizing , Oracle• Full spec at 3rd WBDB, Xian, China

BigBench

3

• Data Model- Variety, Volume, Velocity

• Data Generator- PDGF for structured data- Enhancement : Semi-structured & Text generation

• Workload specification- Main driver: retail big data analytics- Covers : data source, declarative & procedural and machine learning

algorithms. • Metrics• Evaluation

- Done on Teradata Aster

Outline

4

• Big Data is not about size only

• 3 V’s- Variety

• Different types of data- Volume

• Huge size- Velocity

• Continuous updates

Data Model

5

Data Model(Variety)

6

• Volume- Based on scale factor- Similar to TPC-DS scaling- Weblogs & product reviews also scaled

• Velocity- Periodic refreshes for all data- Different velocity for different areas

• Vstructured < Vunstructured < Vsemistructured - Queries run with refresh

Data Model(continued)

7

• "Parallel Data Generation Framework“ PDGF- For the structured part of model- Scale factor similar to TPC-DS

• Extensions to PDGF - Web logs - Product reviews

• Web logs: retail customers/guests visiting site- Web logs similar to apache web server logs- Coupled with structured data

• Product reviews: Customers and guest users- Algorithm based on Markov chain- Real data set sample input- Coupled with structured data

Data Generator

8

• Input- real review data- Category information- Rating- Review text

• Initial setup- Parse text

• Produce tokens• Correlation between tokens

- Build repository by category• API for data generation

- Integrated with PDGF- Based on category ID

Data GeneratorTextGen

9

• Workload Queries- 30 queries- Specified in English- No required syntax

• Business functions (Adapted from McKinsey+)- Marketing

• Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences

- Merchandising• Assortment optimization, Pricing optimization

- Operations• Performance transparency, Product return analysis

- Supply chain• Inventory management

- Reporting (customers and products)

Workload (continued)

10

Technical Functions• Data source dimension

- Structured - Semi-structured- Un-structured

• Processing type dimension- Declarative (SQL, HQL)- Procedural - Mix of both

• Analytic technique dimension- Statistical analysis: correlation analysis, time-series, regression - Data mining: classification, clustering, association mining, pattern

analysis and text analysis - Simple reporting: ad hoc queries not covered above

Workload (continued)

11

Workload (continued)Business Categories Query Breakdown

Business category Number of Queries

Percentage

Marketing 18 60%

Merchandising 5 17%

Operations 4 13%

Supply chain 2 7%

New business models

1 3%

12

Workload (continued)Technical Dimensions Breakdown

Query processing type Number of Queries

Percentage

Declarative 10 33%

Procedural 7 23%

Declarative + Procedural 13 43%

13

Technical Dimensions Breakdown(continued)

Data Sources Number of Queries

Percentage

Structured 18 60%Semi-structured 7 23%Un-structured 5 17%

Analytic techniques Number of Queries

Percentage

Statistics analysis 6 20%Data mining 17 57%

Reporting 8 27%

14

• Future Work• Initial thoughts

- Focus on loading and type of processing- MR engines good at loading- DBMS good at SQL- Metric =

•

• Applicable to single and multi-streams

Metrics

15

• BigBench proof of concept• DBMS

- Typically data loaded into tables- Possibly parsing weblogs to get schema- Reviews captured as VARCHAR or BLOB fields- Queries run using SQL + UDF

• MR engine - Data can be loaded on HDFS - MR, HQL, PigLatin can be used

• DBMS & MR engine - DBMS with Hadoop connectors- Data can be placed and split among both- Processing can also be split among two

Evaluation

16

• Teradata Aster- MPP database- Discovery platform- Supports SQL-MR

• System- Aster 5.0- 8 node cluster- 200 GB of data

Evaluation (continued)

17

• Data generation- DSDGen produced structured part- PDGF+ produced semi-structured and un-structured

• Data loaded into tables - Parsed weblogs into Weblogs table- Product reviews table

• Queries- SQL-MR syntax


18

• Example query• Product category affinity analysis

- Computes the probability of browsing products from a category after customers viewed items from another category.

- One form of market basket• Business case

- Marketing - cross-selling

• Type of source- Structured (web sales)

• Processing type - mix of declarative and procedural

• Analytics type- Data mining


19

SELECTcategory_cd1 AS category1_cd ,category_cd2 AS category2_cd , COUNT (*) AS cntFROM basket_generator ( ON

( SELECT i. i_category_id AS category_cd ,s. ws_bill_customer_sk AS customer_idFROM web_sales s INNER JOIN item iON s. ws_item_sk = i_item_sk

)PARTITION BY customer_idBASKET_ITEM (' category_cd ')ITEM_SET_MAX (500)

)GROUP BY 1,2order by 1 ,3 ,2;


20

Evaluation(Sample Queries)

21

Evaluation(Processing Type)

22

• End to end big data benchmark• Data Model

- Adding semi-structured and unstructured data- Different velocity for structured, semi-structured and unstructured- Volume based on scale factor

• Data Generation- Unstructured data using Markov chain model.- Enhancing PDGF for semi-structured and integrating it with

unstructured• Workload

- 30 queries based on McKinsey report- Covers processing type, data source and data mining algorithms

• Evaluation- Aster SQL-MR

Conclusion

23

• Add late binding to web logs- No schema/keys upfront

• Finalizing- Data generator- Metric specifications.- Downloadable kit

• More POC’s- Hadoop ecosystem like HIVE

Future Work

bigbench: big data benchmark proposal

Documents

structured data data

data modelvariety

data generationintegrated

regression data mining

refresh data modelcontinued

product retailerfocus

structured dataenhancement

pattern analysis