current trends and challenges in big data benchmarking

41
Current Trends and Challenges in Big Data Benchmarking Kai Sachs - SPEC Research Group May 2014

Upload: exascale-infolab

Post on 27-Jan-2015

115 views

Category:

Technology


2 download

DESCRIPTION

Years ago, it was common to write a for-loop and call it benchmark. Nowadays, benchmarks are complex pieces of software and specifications. In this talk, the idea of benchmark engineering, trends in the area of benchmarking research and current efforts of the SPEC Research Group and the WBDB community focusing on Big Data will be discussed. The way in which benchmarks are used has changed. Traditionally, they were mostly used for generating throughput numbers. Today, benchmarks are, e.g., used as test frameworks to evaluate different aspects of systems such as scalability or performance. Since benchmarks provide standardized workloads and meaningful metrics, they are increasingly important for research. The benchmark community is currently focusing on new trends such as cloud computing, big data, power-consumption and large scale, highly distributed systems. For several of these trends traditional benchmarking approaches fail: how can we benchmark a highly distributed system with thousands of nodes and data sources? What does a typical Big Data workload look like and how does it scale? How can we benchmark a real world setup in a realistic way on limited resources? What does performance mean in the context of Big Data? What is the right metric? Speaker: Kai Sachs is a member of the Lifecycle & Cloud Management group at SAP AG. He received a joint Diploma degree in business administration and computer science as well as a PhD degree from Technische Universität Darmstadt. His PhD thesis was awarded with the SPEC Distinguished Dissertation Award 2011 for outstanding contributions in the area of performance evaluation and benchmarking. His research interests include software performance engineering, capacity planning, cloud computing and benchmarking. He is co-founder of ACM/SPEC International Conference on Performance Engineering (ICPE). He has served as member of several program and organization committees and as reviewer for many conferences and journals. Among others he was the PC Chair of the SPEC Benchmark Workshop 2010, Program Chair of the Workshop on Hot Topics on Cloud Services 2013 and the Industrial PC Chair of the ICPE 2011. Kai Sachs is currently serving on the editorial board of the CSI Transactions on ICT, as vice-chair of the SPEC Research Group, as PC Co-Chair of the ACM/SPEC ICPE 2015 and as Co-Chair of the Workshop on Big Data Benchmarking 2014.

TRANSCRIPT

Page 1: Current Trends and Challenges in Big Data Benchmarking

Current Trends and Challenges in

Big Data Benchmarking Kai Sachs - SPEC Research Group

May 2014

Page 2: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 2

Hard- & Software Vendors:

Publish results & marketing

Example: 27.500 results submitted only for SPEC CPU2006 benchmarks

Developer:

Analysis & product quality

Example: Regression performance testing

Consumer:

Compare different products

Example: Find the best video card for gaming

IT Architect:

Cloud & hardware sizing

Example: Choosing configuration

Researcher:

Example: Evaluate own implementation using standardized workload

Benchmark Use Cases & Stakeholders

Page 3: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 3

Standard Performance Evaluation Corporation

OSG

Open

Systems

Group

HPG

High

Performance

Group

GWPG

Graphics and

Workstation

Performance

Group

RG

Research

Group

> 80 member organizations & associates

Founded 1988

Page 4: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 4

Standard Performance Evaluation Corporation

Development of Industry Standard Benchmarks

OSG

Open

Systems

Group

HPG

High

Performance

Group

GWPG

Graphics and

Workstation

Performance

Group

RG

Research

Group

> 80 member organizations & associates

Founded 1988

CPU, Java,

Virtualization,

Power, …

OpenMP, MPI

Page 5: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 5

RG

Research

Group

Cloud,

Intrusion

Detection

Systems,

Big Data

Standard Performance Evaluation Corporation

Research Platform

OSG

Open

Systems

Group

HPG

High

Performance

Group

GWPG

Graphics and

Workstation

Performance

Group

> 80 member organizations & associates

Founded 1988

Page 6: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 6

Provide a platform for collaborative research efforts in the areas of

Computer benchmarking and

Quantitative system analysis

Portal for all kinds of benchmarking-related resources

Provide research benchmarks, tools, metrics and scenarios.

Mission Statement

SPEC Research Group

Page 7: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 7

Performance

Performance in a broad sense:

Classical performance metrics

Example: response time, throughput, scalability,

efficiency, and elasticity

Non-functional system properties under the term

dependability

Example: availability, reliability, and security

Page 8: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 8

Big Data Benchmarking Community (BDBC)

‘Incubator’ for Big Data standard benchmark(s) for industry

>200 members on the mailing list

Workshop on Big Data Benchmarking Series

2012 in San Jose, CA & in Pune, India, 2013 in San Jose, CA & Xian, China,

2014 in Potsdam, Germany

Post-proceedings published in LNCS

BDBC is joining the SPEC Research Group

RG Working group focusing on Big Data in preparation

Working group chairs: Chaitan Baru, Tillmann Rabl

Towards a Big Data Standard Benchmark

WBDB 2012 Report: Setting the Direction for Big Data Benchmark Standards

C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, T. Rabl, TPCTC: 2012, collocated with VLDB2012

Page 9: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 9

Other Benchmark Organizations

Transaction Processing Performance Council (TPC)

Focus: Transaction Processing and Database Benchmarks

Most famous benchmarks: TPC-C (OLTP benchmark), TPC-E (OLTP

benchmark), TPC-H (Decision support benchmark)

Embedded Microprocessor Benchmark Consortium (EEMBC)

Focus: hardware and software used in embedded systems

Business Applications Performance Corporation (BAPCo)

Focus: performance benchmarks for personal computers based on

popular computer applications and industry standard operating systems

Page 10: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 10

General Chairs: Chaitan Baru (UC San Diego), Tilmann Rabl (U Toronto), Kai Sachs (SAP)

Local Arrangements: Matthias Uflacker (Hasso Plattner Institute)

Publicity Chair: Henning Schmitz (SAP Innovations Center)

Publication Chair: Meikel Poess (Oracle)

Program Committee

Milind Bhandarkar (Pivotal)

Anja Bog (SAP Labs)

Dhruba Borthakur (Facebook)

Joos-Hendrik Böse (Amazon)

Tobias Bürger (Payback)

Tyson Condi (UCLA)

Kshitij Doshi (Intel)

Pedro Furtado (U Coimbra)

Bhaskar Gowda (Intel)

Goetz Graefe (HP)

Martin Grund (Exascale)

Alfons Kemper (TU München)

Donald Kossmann (ETH Zürich)

Tim Kraska (Brown University)

Wolfgang Lehner (TU Dresden)

Christof Leng (UC Berkeley)

Stefan Manegold (CWI)

Raghu Nambiar (Cisco)

Manoj K. Nambiar (TCS)

Glenn Paulley (Conestoga Col.)

Keynote Speakers: Umesh Dayal, Alexandru Iosup

Scott Pearson (CLDS Industry Fellow)

Andreas Polze (HPI)

Alexander Reinefeld (HU Berlin)

Berni Schiefer (IBM Labs Toronto)

Saptak Sen (Hortonworks)

Florian Stegmaier (University of Passau)

Till Westmann (Oracle Labs)

Jianfeng Zhan (Chinese Academy of Science)

Platinum Sponsor: Gold Sponsors:

Submission: May 30, 2014 (6pm PDT) Short versions of papers (4-8 LNCS pages)

Page 11: Current Trends and Challenges in Big Data Benchmarking

Benchmark Engineering

Page 12: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 12

Past & Present

Past:

It was common to write a for-loop and call it benchmark.

Present:

Benchmarks are complex pieces of software and

specifications.

Benchmark development has turned into a complex team

effort.

Page 13: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 13

The Whetstone Benchmark (1974 – 284 lines) Curnow, H.J., Wichman, B.A. "A Synthetic Benchmark" Computer Journal, Volume 19, Issue 1, Feb. 1976, p. 43-49

Page 14: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 14

SPEC CPU Benchmark Suite – Lines of Code Henning, J. ”SPEC CPU suite growth: an historical perspective” SIGARCH Comput. Archit. News 35, Issue 1, March 2007

Page 15: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 15

Example Components of a Standard Benchmark

Workload

Reporter

Run Rules

Implementation &

Framework (opt.)

Documentation

Metrics

BENCHMARK

Workload specification is the most important part Performance evaluation of message-oriented middleware using the SPECjms2007 benchmark

Kai Sachs, Samuel Kounev, Jean Bacon, Alejandro Buchmann: Performance Evaluation, 2009

Performance Modeling and Benchmarking of Event-Based Systems

Kai Sachs, PhD Thesis, TU Darmstadt, 2010

Page 16: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 16

Workload Requirements

Resilience Benchmarking

Marco Vieira, Henrique Madeira, Kai Sachs, Samuel Kounev in Resilience Assessment and Evaluation, Springer, 2012

Representativeness

Comprehensiveness

Focus

Scalability

Configurability

Page 17: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 17

Workload Description ‘Level’

From TPC-C to Big Data Benchmarks: A Functional Workload Model

Yanpei Chen, Francois Raab, and Randy Katz in Workshop on Big Data Benchmarks, 2012.

Page 18: Current Trends and Challenges in Big Data Benchmarking

Current Trends &

Challenges in Big Data

Benchmarking

Page 19: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 19

Current Trends & Challenges in Benchmarking

Technology:

Virtualization

Cloud

(Big) Data

Map Reduce, Mixed Workload (OLAP / OLTP),

Data / Event Streaming, …

Benchmarking methodology:

Large Scale Systems

Tools:

Data / workload generator

Power consumption

Simulation frameworks

Generic benchmarking frameworks

Technologies

Tools Benchmark

Methodologies

Page 20: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 20

Current Trends & Challenges in Benchmarking

Technology:

Virtualization

Cloud

(Big) Data

Map Reduce, Mixed Workload (OLAP / OLTP),

Data / Event Streaming, …

Benchmarking methodology:

Large Scale Systems

Tools:

Data / workload generator

Power consumption

Simulation frameworks

Generic benchmarking frameworks

Technologies

Tools Benchmark

Methodologies

Page 21: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 21

Benchmark Methodology

System Under Test

Past & Present

Single node

Multiple nodes

Isolated systems

Page 22: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 22

Benchmark Methodology

System Under Test

http://instagram.com/p/W2FCksR9-e/

St. Peter's Square

2005 vs. 2013

Page 23: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 23

Benchmark Methodology

System Under Test

Challenge: Large Scale Systems

Isolation is not guaranteed (or impossible)

High number of nodes

Data amount is very high

Repeatability is an issue

How can we benchmark such systems?

Technologies

Tools Benchmark

Methodology

Page 24: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 24

“Big Data should be Interesting Data!

There are various definitions of Big Data; most center around a number of

V’s like volume, velocity, variety, veracity – in short: interesting data

(interesting in at least one aspect). However, when you look into research

papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see

here in experimental studies is utterly boring. Performance and scalability

experiments are often based on the TPC-H benchmark: completely

synthetic data with a synthetic workload that has been beaten to death for

the last twenty years. Data quality, data cleaning, and data integration

studies are often based on bibliographic data from DBLP, usually old

versions with less than a million publications, prolific authors, and curated

records. I doubt that this is a real challenge for tasks like entity linkage or

data cleaning. So where’s the – interesting – data in Big Data research?”

Where’s the Data in the Big Data Wave? – SIGMOD Blog March 2013

Gerhard Weikum

Page 25: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 25

“Big Data should be Interesting Data!

There are various definitions of Big Data; most center around a number of

V’s like volume, velocity, variety, veracity – in short: interesting data

(interesting in at least one aspect). However, when you look into research

papers on Big Data, in SIGMOD, VLDB, or ICDE, the data that you see

here in experimental studies is utterly boring. Performance and scalability

experiments are often based on the TPC-H benchmark: completely

synthetic data with a synthetic workload that has been beaten to

death for the last twenty years. Data quality, data cleaning, and data

integration studies are often based on bibliographic data from DBLP,

usually old versions with less than a million publications, prolific authors,

and curated records. I doubt that this is a real challenge for tasks like entity

linkage or data cleaning. So where’s the – interesting – data in Big Data

research?”

Where’s the Data in the Big Data Wave? – SIGMOD Blog March 2013

Gerhard Weikum

Page 26: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 26

Big Data Benchmark:

Issues and Challenges

‘Big Data World’

Communities

Benchmark Design

Single benchmark vs. Benchmark collection

Component vs. End-to-end scenario

Specification vs. Implementation

Metric

System under Test

Workload

Page 27: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 27

Enterprise Warehouse + Agglomeration of other data

Structured enterprise data warehouse

Extended to incorporate data from other non-fully structured

data sources (e.g. weblogs, text, streams)

Pool of data with sequence of processing

Enterprise data processing as a pipeline from data ingestion

to transformation, extraction, subsetting, machine learning,

predictive analytics

Data from multiple structured and non-structured sources

Abstractions of the Big Data World from WBDB

Introduction to the 4th Workshop on Big Data Benchmarking

Chaitan Baru

Page 28: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 28

Scenario:

Retail domain

Data:

Structured: based on TPC–DS

Semi-Structured: click streams

Unstructured: product reviews

PDGF used to generate data

BigBench: A Big Data Analytics Benchmark

Data Model

BigBench: Towards an Industry Standard Benchmark for Big Data Analytics

A. Ghazal, Minqing Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, H. Jacobsen. SIGMOD 2013

Page 29: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 29

Extended version of parallel data generation framework (PDGF)

Separate review generator

BigBench: A Big Data Analytics Benchmark

Data Generation – Unstructured Data

BigBench: Towards an Industry Standard Benchmark for Big Data Analytics

A. Ghazal, Minqing Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, H. Jacobsen. SIGMOD 2013, to appear

Page 30: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 30

An end-to-end data processing pipeline:

Data from multiple sources

Loose, flexible schema

Data requires structuring

Application characteristics

Processing pipelines

Running models with data

Deep Analytics Pipeline

Introduction to the 4th Workshop on Big Data Benchmarking

Chaitan Baru

Page 31: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 31

Example of an Application:

Determine User Interest Profile by Mining Activities

Scalable distributed inference of dynamic user interests for behavioral targeting

A. Ahmed, Y. Low, M. Aly, V. Josifovski, A.J. Smola, SIGKDD 2011

Page 32: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 32

Composite Benchmark for Transactions and Reporting (CBTR)

OLTP & OLAP Benchmark based on Current and Real Enterprise

Order-to-cash Scenario: 18 tables with 5 - 327 columns 2316 columns in sum

Variable Workload Mix

OLTP sub-workload ST:= {x ∈ ℜ | 0 ≤ x ≤ 1}

OLAP sub-workload SA = 1 - ST

read-only OLTP queries SrT:= {x ∈ ℜ | 0 ≤ x ≤ 1}

mixed OLTP queries SmT = 1 - SrT

S: share T: transactional | A: analytical r: read-only | m: mixed

Benchmarking Composite Transaction and Analytical Processing Systems

Anja Bog, PhD Thesis, University of Potsdam, 2012

Interactive Performance Monitoring of a Composite OLTP & OLAP Workload

Anja Bog, Kai Sachs, Hasso Plattner. SIGMOD 2012 (Demo)

Normalization in a Mixed OLTP and OLAP Workload Scenario

Anja Bog, Kai Sachs, Alexander Zeier, Hasso Plattner. TPCTC 2011, collocated with VLDB2011

Page 33: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 33

Big Data & Cloud Benchmark

Related Work – Virtualization Benchmarking

Page 34: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 34

Big Data & Cloud Benchmark

Related Work – Virtualization Benchmarking

Page 35: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 35

Other activities

TPC–BD

TPC announced a Big Data working group (11.2013)

Graph 500

Driven by HPC community

Cooperating with SPEC CPU group

Green Graph 500 list

SPEC OSG

Big Data as part of a cloud benchmark

Cloudsuite 2.0, CH-benCHmark, BigDataBench, HiBench,

LinkedBench …

Page 36: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 36

Target group

Researchers & developers

Data categories

Structured, unstructured and semi-structured; events & streams; graphs;

geospatial, retail, astronomy & genomic; …

Benchmark scenario & metrics

Realistic use-cases & workload mixes

Big Data Classification schema

(Research) Standard Benchmarks

BigBench, Deep Analytics Pipeline, …

Data generation

Real world traces & synthetic data, tooling

SPEC RG – Big Data Working Group

Potential Topics

Page 37: Current Trends and Challenges in Big Data Benchmarking

Conclusions

Page 38: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 38

Conclusions

Benchmarking is more than throughput

Meaningful workloads are most important

Page 39: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 39

Conclusions

Benchmarking is more than throughput

Meaningful workloads are most important

More research is needed

Benchmarking of large scale systems

“Big Data World”: Workloads & scenarios

Benchmarks for Big Data

We Don’t Know Enough to make a Big Data

Benchmark Suite

Yanpei Chen, WBDB 2012

Page 40: Current Trends and Challenges in Big Data Benchmarking

Thank you

Contact information:

Kai Sachs

Email: [email protected]

Disclaimer:

SPEC, the SPEC logo, the SPEC Research Group logo and the tool and names SERT, SPECjms2007, SPECpower_ssj2008, SPECweb2009 and

SPECvirt_sc2010 are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). Reprint with permission.

Page 41: Current Trends and Challenges in Big Data Benchmarking

© 2014 Kai Sachs. All rights reserved. 41

General Chairs: Chaitan Baru (UC San Diego), Tilmann Rabl (U Toronto), Kai Sachs (SAP)

Local Arrangements: Matthias Uflacker (Hasso Plattner Institute)

Publicity Chair: Henning Schmitz (SAP Innovations Center)

Publication Chair: Meikel Poess (Oracle)

Program Committee

Milind Bhandarkar (Pivotal)

Anja Bog (SAP Labs)

Dhruba Borthakur (Facebook)

Joos-Hendrik Böse (Amazon)

Tobias Bürger (Payback)

Tyson Condi (UCLA)

Kshitij Doshi (Intel)

Pedro Furtado (U Coimbra)

Bhaskar Gowda (Intel)

Goetz Graefe (HP)

Martin Grund (Exascale)

Alfons Kemper (TU München)

Donald Kossmann (ETH Zürich)

Tim Kraska (Brown University)

Wolfgang Lehner (TU Dresden)

Christof Leng (UC Berkeley)

Stefan Manegold (CWI)

Raghu Nambiar (Cisco)

Manoj K. Nambiar (TCS)

Glenn Paulley (Conestoga Col.)

Keynote Speakers: Umesh Dayal, Alexandru Iosup

Scott Pearson (CLDS Industry Fellow)

Andreas Polze (HPI)

Alexander Reinefeld (HU Berlin)

Berni Schiefer (IBM Labs Toronto)

Saptak Sen (Hortonworks)

Florian Stegmaier (University of Passau)

Till Westmann (Oracle Labs)

Jianfeng Zhan (Chinese Academy of Science)

Platinum Sponsor: Gold Sponsors:

Submission: May 30, 2014 (6pm PDT) Short versions of papers (4-8 LNCS pages)