bigbench discussion tilmann rabl, msrg.org, uoft third workshop on big data benchmarking, xi’an...
TRANSCRIPT
![Page 1: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/1.jpg)
BigBench Discussion
Tilmann Rabl, msrg.org, UofT
Third Workshop on Big Data Benchmarking, Xi’an
July 16, 2013
![Page 2: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/2.jpg)
Introduction BigBench
• End to end benchmark• Application level
• Based on a product retailer (TPC-DS)
• Focus on • Parallel DBMS • MR engines
• History• Launched at 1st WBDB, San Jose• Published at SIGMOD 2013• Full spec at WBDB proceedings 2012 (in progress)
• Collaboration with Industry & Academia• Teradata, University of Toronto, InfoSizing , Oracle
![Page 3: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/3.jpg)
Outline
• Data Model• Variety, Volume, Velocity
• Workload• Main driver: retail big data analytics• Covers : data source, declarative & procedural
and machine learning algorithms.
• Metrics
• Discussion
![Page 4: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/4.jpg)
Data Model I
• 3 Parts• Structured: TPC-DS + market prices• Semi-structured: website click-stream• Unstructured: customers’ reviews
![Page 5: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/5.jpg)
Data Model II
• Variety• Different schema parts
• Volume• Based on scale factor• Similar to TPC-DS scaling• Weblogs & product reviews also scaled
• Velocity• Periodic refreshes for all data• Different velocity for different areas
• Vstructured < Vunstructured < Vsemistructured • Queries run with refresh
![Page 6: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/6.jpg)
Workload
• Workload Queries• 30 queries• Specified in English• No required syntax
• Business functions (Adapted from McKinsey+)• Marketing
• Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences
• Merchandising• Assortment optimization, Pricing optimization
• Operations• Performance transparency, Product return analysis
• Supply chain• Inventory management
• Reporting (customers and products)
![Page 7: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/7.jpg)
SQL-MR Query 1
SELECTcategory_cd1 AS category1_cd ,category_cd2 AS category2_cd , COUNT (*) AS cntFROM basket_generator ( ON
( SELECT i. i_category_id AS category_cd ,s. ws_bill_customer_sk AS customer_idFROM web_sales s INNER JOIN item iON s. ws_item_sk = i_item_sk
)PARTITION BY customer_idBASKET_ITEM (' category_cd ')ITEM_SET_MAX (500)
)GROUP BY 1,2order by 1 ,3 ,2;
![Page 8: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/8.jpg)
Workload – Technical Aspects
Data Sources Number of Queries
Percentage
Structured 18 60%
Semi-structured 7 23%
Un-structured 5 17%
Analytic techniques Number of Queries
Percentage
Statistics analysis 6 20%
Data mining 17 57%
Reporting 8 27%
![Page 9: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/9.jpg)
Metrics
• Discussion topic..• Initial thoughts
• Focus on loading and type of processing• MR engines good at loading• DBMS good at SQL• Metric =
•
• Applicable to single and multi-streams
![Page 10: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/10.jpg)
Discussion Topics
• What did we miss? • Extending/completing BigBench• How to go beyond MapReduce?
• E.g. add social/graph component (e.g., LinkBench)• NoSQL, cube model, recommendations
• What kind of queries, schema extensions are necessary?
• What has to be measured?• Metrics and properties• Price performance, (energy) efficiency• What properties does the system have to fulfill?
• Any other concerns?
![Page 11: BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013](https://reader035.vdocuments.us/reader035/viewer/2022072013/56649e5c5503460f94b54816/html5/thumbnails/11.jpg)
Discussion – Organization
• Break-out groups (2)• Discussion leader• Scribe: Prepare 5 minute presentation
• Student groups (2)• Find an English speaking member (presenter)• Prepare 1+ slide on each question
• Schedule:• 14:15 – 15:15 Discussion 1• 15 min break• 15:30 – 16:15 Discussion 2• 16:15 – 17:00 Plenary