benchmarking hadoop and big data

@BDOOP_BCN

Benchmarking Hadoop by Nicolas Poggi @ni_po

June 2, 2015

About Nicolas Poggi @ni_po

What is BDOOP about? ● A group to share on Data

● Scalability

● Performance

● Configurations

● Cluster design

● Benchmarking

● …a/couple of beer/s!

• Having sysadmins in mind

● Also POs

● Not a group to learn

• Java

• Mapreduce programming

• Hadoop base concepts

BDOOP Group Objectives ● Create a local community to

● Learn Big Data ● performance and scalability

● Share ● day-to-day problems and solutions

● Present your work and findings

● Have talks from renown experts

● > Your objective here <

Benchmarking Motivation and Intro

Hadoop design

Hadoop designed to solve complex data Structured and non structured

With [close to] linear scalability

Simplifying the programming model From MPI, OpenMP, CUDA, …

Operates as a blackbox for data analysts

Image source: Hadoop, the definitive guide

Hadoop attributes Fault tolerant

from commodity hardware

Built in redundancy

via replication

Automatic scales out / down

With [almost] linear scalability

Move computation to data

minimize communication

Share nothing architecture

Hadoop highly-scalable but… Not a high-performance solution!

Requires Design,

Clusters, topology clusters

Setup, OS, Hadoop config

and tuning required Iterative approach

Time consuming

And extensive benchmarking!

Hadoop parameters > 100+ tunable parameters

mapred.map/reduce.tasks.speculative.execution

obscure and interrelated

io.sort.mb 100 (300)

io.sort.record.percent 5% (15%)

io.sort.spill.percent 80% (95 – 100%)

Number of Mappers and Reducers

Rule of thumb 0.5 - 2 per CPU core

Hadoop ecosystem

Large and spread

Dominated by big players

Custom patches

Default values not ideal

Product claims

Cloud vs. On-premise

IaaS

PaaS

EMR, HDInsight

Needs standardization and auditing!

DATA

Product claims Need auditing!

Workload (jobs) All jobs are different!

Different requirements CPU bound

Memory bound

I/O bound … a bit of all

Different tuning for each Needs benchmarking!

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular benchmarks:

One for all config?

Vertical line: Average performance for this workload across configurations Values to the right: above average Values to the left: below average

Is there one software configuration iteration that fits everybody?

Co

nfi

gura

tio

ns

13

Good for Terasort but bad for Wordcount

Good for Terasort but bad for Wordcount

Good for Wordcount but very bad for Terasort

Example of SSD impact to Execution time

Impact of SSDs to running time of Terasort

SSDs

HDDs

Co

nfi

gura

tio

ns

SSD

SATA

Too many choices?

Remote volumes

-

-

Rotational HDDs

JBODs

Large VMs

Small VMs

Gb Ethernet

InfiniBand

RAID

Cost

Performance

On-Premise

Cloud

And where is my system configuration positioned on

each of these axes?

High availability

Replication

+

+

Benchmarks

Why benchmark? Validate assumptions

Reproduce bad behavior

Debugging

Measure performance and scale

Simulate higher load

Find bottlenecks / limits

Plan for growth

Test different

SW and HW

Source: Based on High Performance MySQL, benchmarking MySQL chapter

Benchmarking stakeholders and use cases

End-user / consumer Compare products

Developer Profiling

CI / QA

Sysadmin / architect Cluster sizing

SW and HW vendors Product claims

Marketing

Researcher …

Big Data Vs

Volume

Velocity Variety

Structured, semi, unstructured data Different types of data (genres)

Veracity Value

Sample scale factor from TPCx-HS

Data generation Real vs. Synthetic

Random data vs. repeatable

Data generation time

Paralle

Data distribution

Flat or uniformly distributed

Gaussian (normal distribution, skew)

Issues Benchmarking Big Data Big Scale

Single node vs Multiple nodes 10MB vs 10TB

On-metal vs. virtualized vs. cloud

Non-deterministic / Randomness

Need to average multiple runs

How long to benchmark

System warm-up

Distributed systems

Failures?

Types of benchmarks and Standards Micro benchmarks

HDFSIO

Functional

Terasort, ETL

Genre-specific

Graph 500

Application level

BigBench

TPC (implementation) vs SPEC (reference)

TPC vs. SPEC models

Specification based

Performance, price, energy in one benchmark

End-to-end

Multiple tests (ACID, load)

Independent review

Full disclosure

TPC Technology Conference

Kit based

Performance and energy in separate benchmarks

Server-centric

Single test

Peer review

Summary disclosure

SPEC Research Group, ICPE

Source: From presentation by Meikel Poess, 1stWBDB, May 2012

Data Benchmarks

Classical SQL OLAP DB Big Data

First there was TPC-H Classical SQL OLAP

benchmark

MRBench for M/R

On top of Hive or Impala for Hadoop

Then sorting Terasort

Unofficial standard

Now part of TPCx-HS

Hadoop samples Wordcount, grep, terasort, DFSIO

YCSB From Yahoo! For NoSQL, HBASE implementation

GridMix

CALDA

HiBench

SWIM

BigBench based on TCP-DS + ML 30 queries

BigDataBench 33 workloads

TPCx-HS

Comparison of popular Hadoop benchmarks

Spec[1

] App domains

Workload types

Workloads

Scalable data sets[2]

Diverse implem[3]

Multi- tenancy[4]

Subset[5] Simulator[6]

BigDataBench Y Five Four[7] Thirty-three [8]

Eight[9] Y Y Y Y

BigBench Y One Three Ten Three N N N N

CloudSuite N N/A Two Eight Three N N N Y

HiBench N N/A Two Ten Three N N N N

CALDA Y N/A One Five N/A Y N N N

YCSB Y N/A One Six N/A Y N N N

LinkBench Y N/A One Ten N/A Y N N N

AMP Benchmarks

Y N/A One Four N/A Y N N N

The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage

What to measure and metrics

Job execution time

Throughput Units / time

Framework overhead # of spills

Scalability

Concurrency

Abstract metrics

CPU

MEM

DISK IOPS, latency, bandwidth

NET Latency bandwidth

TPCx-HS performance metric (HSph@SF)

Benchmarking

Project ALOJA online repository Entry point for explore the results collected from the

executions,

Provides insights on the obtained results through continuously evolving data views.

Online results at: http://hadoop.bsc.es

http://hadoop.bsc.es/

ALOJA Platform: Evolution and status Benchmarking, Repository, and Analytics tools for Big Data

Composed of open-source Benchmarking, provisioning and orchestration tools,

high-level system performance metric collection,

low-level Hadoop instrumentation based on BSC Tools

and Web based data analytics tools And recommendations

Online Big Data Benchmark repository of: 20,000+ runs (from HiBench)

Sharable, comparable, repeatable, verifiable executions

Abstracting and leveraging tools for BD benchmarking Not reinventing the wheel but,

most current BD tools designed for production, not for benchmarking

leverages current compatible tools and projects

Dev VM toolset and sandbox via Vagrant

Big Data Benchmarking

Online Repository

Analytics

Workflow in ALOJA Cluster(s) definition

• VM sizes

• # nodes

• OS, disks

• Capabilities

Execution plan

• Start cluster

• Exec Benchmarks

• Gather results

• Cleanup

Import data

• Convert perf metric

• Parse logs

• Import into DB

Evaluate data

• Data views in Vagrant VM

• Or http://hadoop.bsc.es

PA and KD •Predictive

Analytics

•Knowledge Discovery

Historic Repo

34

Benchmarks Execution comparisons You can compare, side by side, all execution parameters:

CPU, Memory, Network, Disk, Hadoop parameters….

Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144

HiBench suite HiBench : A Benchmark Suite for Hadoop

4

HiBench

A Comprehensive & Realistic Benchmark Suite

Enhanced DFSIO

Micro Benchmarks Web Search

Sort

WordCount

TeraSort

Nutch Indexing

Page Rank

Machine Learning

Bayesian Classification

K-Means Clustering

HDFS

Code at: https://github.com/intel-hadoop/HiBench

Job resource requirements 1/2

Source: Intel HiBench

Job resource requirements 2/2

Source: Intel HiBench

Impact of SW configurations in Speedup

Number of mappers Compression algorithm

No comp.

ZLIB

BZIP2

snappy

4m

6m

8m

10m

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

http://hadoop.bsc.es/configimprovement



Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local only

1 Remote

2 Remotes

3 Remotes

3 Remotes

/tmp local

2 Remotes /tmp local

1 Remotes

/tmp local

HDD-ETH

HDD-IB

SSD-ETH

SDD-IB

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf




Cost/Performance Scalability Terasort (100GB)

Sample from: http://hadoop.bsc.es/nodeseval

Execution time Execution cost

InfiniBand + SDD (LOCAL)

GbE SDD + (LOCAL) CLOUD (local disk /tmp and HDFS)

CLOUD (/tmp in Local Disk, HDFS in Blob storage 1-3 devices)

CLOUD (/tmp and HDFS in Blob storage 1-3 devices)

InfiniBand + SATA disks (LOCAL)

GbE+ SATA disks (LOCAL)

Price

Performance

Cost-effectiveness On-premise vs. Cloud)

Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Common Benchmarking pitfalls Scalability

Assuming near scalability

Compare apples to apples if benchmarking HW change HW but leave SW the same

Terasort in v1 != Terasort in v2

Test for Big Data use large data

stress the system

If results are too good to be true, they probably aren't Don’t believe in miracles

Expect vendor lies

Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain

Resources ALOJA Benchmarking platform and online repository


Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80organizations) http://clds.sdsc.edu/bdbc/community

Workshop Big Data Benchmarking (WBDB) Next: http://clds.sdsc.edu/wbdb2015.ca

SPEC Research Big Data working group http://research.spec.org/working-groups/big-data-working-group.html

Slides and video: Michael Frank on Big Data benchmarking

http://www.tele-task.de/archive/podcast/20430/

Tilmann Rabl Big Data Benchmarking Tutorial http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl




http://clds.sdsc.edu/bdbc/community



http://clds.sdsc.edu/wbdb2015.ca



http://research.spec.org/working-groups/big-data-working-group.html















http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl




@BDOOP_BCN

Benchmarking Hadoop by Nicolas Poggi @ni_po

June 2, 2015

benchmarking hadoop and big data

Technology

hadoop design hadoop

bcn benchmarking hadoop

hadoop config

hadoop parameters

hadoop attributes

big data performance

needs benchmarking

hadoop ecosystem large