bigbench: big data benchmark proposal
DESCRIPTION
BigBench: Big Data Benchmark Proposal. Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen. BigBench. End to end benchmark Based on a product retailer Focus on Parallel DBMS MR engines - PowerPoint PPT PresentationTRANSCRIPT
BigBench: Big Data Benchmark Proposal
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen
2
• End to end benchmark• Based on a product retailer• Focus on
- Parallel DBMS - MR engines
• Initial work presented at 1st WBDB, San Jose• Collaboration with Industry & Academia
- Teradata, University of Toronto, InfoSizing , Oracle• Full spec at 3rd WBDB, Xian, China
BigBench
3
• Data Model- Variety, Volume, Velocity
• Data Generator- PDGF for structured data- Enhancement : Semi-structured & Text generation
• Workload specification- Main driver: retail big data analytics- Covers : data source, declarative & procedural and machine learning
algorithms. • Metrics• Evaluation
- Done on Teradata Aster
Outline
4
• Big Data is not about size only
• 3 V’s- Variety
• Different types of data- Volume
• Huge size- Velocity
• Continuous updates
Data Model
5
Data Model(Variety)
6
• Volume- Based on scale factor- Similar to TPC-DS scaling- Weblogs & product reviews also scaled
• Velocity- Periodic refreshes for all data- Different velocity for different areas
• Vstructured < Vunstructured < Vsemistructured - Queries run with refresh
Data Model(continued)
7
• "Parallel Data Generation Framework“ PDGF- For the structured part of model- Scale factor similar to TPC-DS
• Extensions to PDGF - Web logs - Product reviews
• Web logs: retail customers/guests visiting site- Web logs similar to apache web server logs- Coupled with structured data
• Product reviews: Customers and guest users- Algorithm based on Markov chain- Real data set sample input- Coupled with structured data
Data Generator
8
• Input- real review data- Category information- Rating- Review text
• Initial setup- Parse text
• Produce tokens• Correlation between tokens
- Build repository by category• API for data generation
- Integrated with PDGF- Based on category ID
Data GeneratorTextGen
9
• Workload Queries- 30 queries- Specified in English- No required syntax
• Business functions (Adapted from McKinsey+)- Marketing
• Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences
- Merchandising• Assortment optimization, Pricing optimization
- Operations• Performance transparency, Product return analysis
- Supply chain• Inventory management
- Reporting (customers and products)
Workload (continued)
10
Technical Functions• Data source dimension
- Structured - Semi-structured- Un-structured
• Processing type dimension- Declarative (SQL, HQL)- Procedural - Mix of both
• Analytic technique dimension- Statistical analysis: correlation analysis, time-series, regression - Data mining: classification, clustering, association mining, pattern
analysis and text analysis - Simple reporting: ad hoc queries not covered above
Workload (continued)
11
Workload (continued)Business Categories Query Breakdown
Business category Number of Queries
Percentage
Marketing 18 60%
Merchandising 5 17%
Operations 4 13%
Supply chain 2 7%
New business models
1 3%
12
Workload (continued)Technical Dimensions Breakdown
Query processing type Number of Queries
Percentage
Declarative 10 33%
Procedural 7 23%
Declarative + Procedural 13 43%
13
Technical Dimensions Breakdown(continued)
Data Sources Number of Queries
Percentage
Structured 18 60%Semi-structured 7 23%Un-structured 5 17%
Analytic techniques Number of Queries
Percentage
Statistics analysis 6 20%Data mining 17 57%
Reporting 8 27%
14
• Future Work• Initial thoughts
- Focus on loading and type of processing- MR engines good at loading- DBMS good at SQL- Metric =
•
• Applicable to single and multi-streams
Metrics
15
• BigBench proof of concept• DBMS
- Typically data loaded into tables- Possibly parsing weblogs to get schema- Reviews captured as VARCHAR or BLOB fields- Queries run using SQL + UDF
• MR engine - Data can be loaded on HDFS - MR, HQL, PigLatin can be used
• DBMS & MR engine - DBMS with Hadoop connectors- Data can be placed and split among both- Processing can also be split among two
Evaluation
16
• Teradata Aster- MPP database- Discovery platform- Supports SQL-MR
• System- Aster 5.0- 8 node cluster- 200 GB of data
Evaluation (continued)
17
• Data generation- DSDGen produced structured part- PDGF+ produced semi-structured and un-structured
• Data loaded into tables - Parsed weblogs into Weblogs table- Product reviews table
• Queries- SQL-MR syntax
Evaluation (continued)
18
• Example query• Product category affinity analysis
- Computes the probability of browsing products from a category after customers viewed items from another category.
- One form of market basket• Business case
- Marketing - cross-selling
• Type of source- Structured (web sales)
• Processing type - mix of declarative and procedural
• Analytics type- Data mining
Evaluation (continued)
19
SELECTcategory_cd1 AS category1_cd ,category_cd2 AS category2_cd , COUNT (*) AS cntFROM basket_generator ( ON
( SELECT i. i_category_id AS category_cd ,s. ws_bill_customer_sk AS customer_idFROM web_sales s INNER JOIN item iON s. ws_item_sk = i_item_sk
)PARTITION BY customer_idBASKET_ITEM (' category_cd ')ITEM_SET_MAX (500)
)GROUP BY 1,2order by 1 ,3 ,2;
Evaluation (continued)
20
Evaluation(Sample Queries)
21
Evaluation(Processing Type)
22
• End to end big data benchmark• Data Model
- Adding semi-structured and unstructured data- Different velocity for structured, semi-structured and unstructured- Volume based on scale factor
• Data Generation- Unstructured data using Markov chain model.- Enhancing PDGF for semi-structured and integrating it with
unstructured• Workload
- 30 queries based on McKinsey report- Covers processing type, data source and data mining algorithms
• Evaluation- Aster SQL-MR
Conclusion
23
• Add late binding to web logs- No schema/keys upfront
• Finalizing- Data generator- Metric specifications.- Downloadable kit
• More POC’s- Hadoop ecosystem like HIVE
Future Work