bigbench discussion tilmann rabl, msrg.org, uoft third workshop on big data benchmarking, xi’an...

11
BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Upload: dominick-french

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

BigBench Discussion

Tilmann Rabl, msrg.org, UofT

Third Workshop on Big Data Benchmarking, Xi’an

July 16, 2013

Page 2: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Introduction BigBench

• End to end benchmark• Application level

• Based on a product retailer (TPC-DS)

• Focus on • Parallel DBMS • MR engines

• History• Launched at 1st WBDB, San Jose• Published at SIGMOD 2013• Full spec at WBDB proceedings 2012 (in progress)

• Collaboration with Industry & Academia• Teradata, University of Toronto, InfoSizing , Oracle

Page 3: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Outline

• Data Model• Variety, Volume, Velocity

• Workload• Main driver: retail big data analytics• Covers : data source, declarative & procedural

and machine learning algorithms.

• Metrics

• Discussion

Page 4: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Data Model I

• 3 Parts• Structured: TPC-DS + market prices• Semi-structured: website click-stream• Unstructured: customers’ reviews

Page 5: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Data Model II

• Variety• Different schema parts

• Volume• Based on scale factor• Similar to TPC-DS scaling• Weblogs & product reviews also scaled

• Velocity• Periodic refreshes for all data• Different velocity for different areas

• Vstructured < Vunstructured < Vsemistructured • Queries run with refresh

Page 6: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Workload

• Workload Queries• 30 queries• Specified in English• No required syntax

• Business functions (Adapted from McKinsey+)• Marketing

• Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences

• Merchandising• Assortment optimization, Pricing optimization

• Operations• Performance transparency, Product return analysis

• Supply chain• Inventory management

• Reporting (customers and products)

Page 7: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

SQL-MR Query 1

SELECTcategory_cd1 AS category1_cd ,category_cd2 AS category2_cd , COUNT (*) AS cntFROM basket_generator ( ON

( SELECT i. i_category_id AS category_cd ,s. ws_bill_customer_sk AS customer_idFROM web_sales s INNER JOIN item iON s. ws_item_sk = i_item_sk

)PARTITION BY customer_idBASKET_ITEM (' category_cd ')ITEM_SET_MAX (500)

)GROUP BY 1,2order by 1 ,3 ,2;

Page 8: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Workload – Technical Aspects

Data Sources Number of Queries

Percentage

Structured 18 60%

Semi-structured 7 23%

Un-structured 5 17%

Analytic techniques Number of Queries

Percentage

Statistics analysis 6 20%

Data mining 17 57%

Reporting 8 27%

Page 9: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Metrics

• Discussion topic..• Initial thoughts

• Focus on loading and type of processing• MR engines good at loading• DBMS good at SQL• Metric =

• Applicable to single and multi-streams

Page 10: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Discussion Topics

• What did we miss? • Extending/completing BigBench• How to go beyond MapReduce?

• E.g. add social/graph component (e.g., LinkBench)• NoSQL, cube model, recommendations

• What kind of queries, schema extensions are necessary?

• What has to be measured?• Metrics and properties• Price performance, (energy) efficiency• What properties does the system have to fulfill?

• Any other concerns?

Page 11: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013

Discussion – Organization

• Break-out groups (2)• Discussion leader• Scribe: Prepare 5 minute presentation

• Student groups (2)• Find an English speaking member (presenter)• Prepare 1+ slide on each question

• Schedule:• 14:15 – 15:15 Discussion 1• 15 min break• 15:30 – 16:15 Discussion 2• 16:15 – 17:00 Plenary