benchmarking hadoop and big data

41
@BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015

Upload: nico-poggi

Post on 21-Aug-2015

94 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Benchmarking Hadoop and Big Data

@BDOOP_BCN

Benchmarking Hadoop by Nicolas Poggi @ni_po

June 2, 2015

Page 2: Benchmarking Hadoop and Big Data

About Nicolas Poggi @ni_po

Page 3: Benchmarking Hadoop and Big Data

What is BDOOP about? ● A group to share on Data

● Scalability

● Performance

● Configurations

● Cluster design

● Benchmarking

● …a/couple of beer/s!

• Having sysadmins in mind

● Also POs

● Not a group to learn

• Java

• Mapreduce programming

• Hadoop base concepts

Page 4: Benchmarking Hadoop and Big Data

BDOOP Group Objectives ● Create a local community to

● Learn Big Data ● performance and scalability

● Share ● day-to-day problems and solutions

● Present your work and findings

● Have talks from renown experts

● > Your objective here <

Page 5: Benchmarking Hadoop and Big Data

Benchmarking Motivation and Intro

Page 6: Benchmarking Hadoop and Big Data

Hadoop design

Hadoop designed to solve complex data Structured and non structured

With [close to] linear scalability

Simplifying the programming model From MPI, OpenMP, CUDA, …

Operates as a blackbox for data analysts

Image source: Hadoop, the definitive guide

Page 7: Benchmarking Hadoop and Big Data

Hadoop attributes Fault tolerant

from commodity hardware

Built in redundancy

via replication

Automatic scales out / down

With [almost] linear scalability

Move computation to data

minimize communication

Share nothing architecture

Page 8: Benchmarking Hadoop and Big Data

Hadoop highly-scalable but… Not a high-performance solution!

Requires Design,

Clusters, topology clusters

Setup, OS, Hadoop config

and tuning required Iterative approach

Time consuming

And extensive benchmarking!

Page 9: Benchmarking Hadoop and Big Data

Hadoop parameters > 100+ tunable parameters

mapred.map/reduce.tasks.speculative.execution

obscure and interrelated

io.sort.mb 100 (300)

io.sort.record.percent 5% (15%)

io.sort.spill.percent 80% (95 – 100%)

Number of Mappers and Reducers

Rule of thumb 0.5 - 2 per CPU core

Page 10: Benchmarking Hadoop and Big Data

Hadoop ecosystem

Large and spread

Dominated by big players

Custom patches

Default values not ideal

Product claims

Cloud vs. On-premise

IaaS

PaaS

EMR, HDInsight

Needs standardization and auditing!

DATA

Page 11: Benchmarking Hadoop and Big Data

Product claims Need auditing!

Page 12: Benchmarking Hadoop and Big Data

Workload (jobs) All jobs are different!

Different requirements CPU bound

Memory bound

I/O bound … a bit of all

Different tuning for each Needs benchmarking!

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular benchmarks:

Page 13: Benchmarking Hadoop and Big Data

One for all config?

Vertical line: Average performance for this workload across configurations Values to the right: above average Values to the left: below average

Is there one software configuration iteration that fits everybody?

Co

nfi

gura

tio

ns

13

Good for Terasort but bad for Wordcount

Good for Terasort but bad for Wordcount

Good for Wordcount but very bad for Terasort

Page 14: Benchmarking Hadoop and Big Data

Example of SSD impact to Execution time

Impact of SSDs to running time of Terasort

SSDs

HDDs

Co

nfi

gura

tio

ns

SSD

SATA

Page 15: Benchmarking Hadoop and Big Data

Too many choices?

Remote volumes

-

-

Rotational HDDs

JBODs

Large VMs

Small VMs

Gb Ethernet

InfiniBand

RAID

Cost

Performance

On-Premise

Cloud

And where is my system configuration positioned on

each of these axes?

High availability

Replication

+

+

Page 16: Benchmarking Hadoop and Big Data

Benchmarks

Page 17: Benchmarking Hadoop and Big Data

Why benchmark? Validate assumptions

Reproduce bad behavior

Debugging

Measure performance and scale

Simulate higher load

Find bottlenecks / limits

Plan for growth

Test different

SW and HW

Source: Based on High Performance MySQL, benchmarking MySQL chapter

Page 18: Benchmarking Hadoop and Big Data

Benchmarking stakeholders and use cases

End-user / consumer Compare products

Developer Profiling

CI / QA

Sysadmin / architect Cluster sizing

SW and HW vendors Product claims

Marketing

Researcher …

Page 19: Benchmarking Hadoop and Big Data

Big Data Vs

Volume

Velocity Variety

Structured, semi, unstructured data Different types of data (genres)

Veracity Value

Sample scale factor from TPCx-HS

Page 20: Benchmarking Hadoop and Big Data

Data generation Real vs. Synthetic

Random data vs. repeatable

Data generation time

Paralle

Data distribution

Flat or uniformly distributed

Gaussian (normal distribution, skew)

Page 21: Benchmarking Hadoop and Big Data

Issues Benchmarking Big Data Big Scale

Single node vs Multiple nodes 10MB vs 10TB

On-metal vs. virtualized vs. cloud

Non-deterministic / Randomness

Need to average multiple runs

How long to benchmark

System warm-up

Distributed systems

Failures?

Page 22: Benchmarking Hadoop and Big Data

Types of benchmarks and Standards Micro benchmarks

HDFSIO

Functional

Terasort, ETL

Genre-specific

Graph 500

Application level

BigBench

TPC (implementation) vs SPEC (reference)

Page 23: Benchmarking Hadoop and Big Data

TPC vs. SPEC models

Specification based

Performance, price, energy in one benchmark

End-to-end

Multiple tests (ACID, load)

Independent review

Full disclosure

TPC Technology Conference

Kit based

Performance and energy in separate benchmarks

Server-centric

Single test

Peer review

Summary disclosure

SPEC Research Group, ICPE

Source: From presentation by Meikel Poess, 1stWBDB, May 2012

Page 24: Benchmarking Hadoop and Big Data

Data Benchmarks

Classical SQL OLAP DB Big Data

First there was TPC-H Classical SQL OLAP

benchmark

MRBench for M/R

On top of Hive or Impala for Hadoop

Then sorting Terasort

Unofficial standard

Now part of TPCx-HS

Hadoop samples Wordcount, grep, terasort, DFSIO

YCSB From Yahoo! For NoSQL, HBASE implementation

GridMix

CALDA

HiBench

SWIM

BigBench based on TCP-DS + ML 30 queries

BigDataBench 33 workloads

TPCx-HS

Page 25: Benchmarking Hadoop and Big Data

Comparison of popular Hadoop benchmarks

Spec[1

] App domains

Workload types

Workloads

Scalable data sets[2]

Diverse implem[3]

Multi- tenancy[4]

Subset[5] Simulator[6]

BigDataBench Y Five Four[7] Thirty-three [8]

Eight[9] Y Y Y Y

BigBench Y One Three Ten Three N N N N

CloudSuite N N/A Two Eight Three N N N Y

HiBench N N/A Two Ten Three N N N N

CALDA Y N/A One Five N/A Y N N N

YCSB Y N/A One Six N/A Y N N N

LinkBench Y N/A One Ten N/A Y N N N

AMP Benchmarks

Y N/A One Four N/A Y N N N

The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage

Page 26: Benchmarking Hadoop and Big Data

What to measure and metrics

Job execution time

Throughput Units / time

Framework overhead # of spills

Scalability

Concurrency

Abstract metrics

CPU

MEM

DISK IOPS, latency, bandwidth

NET Latency bandwidth

TPCx-HS performance metric (HSph@SF)

Page 27: Benchmarking Hadoop and Big Data

Benchmarking

Page 28: Benchmarking Hadoop and Big Data

Project ALOJA online repository Entry point for explore the results collected from the

executions,

Provides insights on the obtained results through continuously evolving data views.

Online results at: http://hadoop.bsc.es

Page 29: Benchmarking Hadoop and Big Data

ALOJA Platform: Evolution and status Benchmarking, Repository, and Analytics tools for Big Data

Composed of open-source Benchmarking, provisioning and orchestration tools,

high-level system performance metric collection,

low-level Hadoop instrumentation based on BSC Tools

and Web based data analytics tools And recommendations

Online Big Data Benchmark repository of: 20,000+ runs (from HiBench)

Sharable, comparable, repeatable, verifiable executions

Abstracting and leveraging tools for BD benchmarking Not reinventing the wheel but,

most current BD tools designed for production, not for benchmarking

leverages current compatible tools and projects

Dev VM toolset and sandbox via Vagrant

Big Data Benchmarking

Online Repository

Analytics

Page 30: Benchmarking Hadoop and Big Data

Workflow in ALOJA Cluster(s) definition

• VM sizes

• # nodes

• OS, disks

• Capabilities

Execution plan

• Start cluster

• Exec Benchmarks

• Gather results

• Cleanup

Import data

• Convert perf metric

• Parse logs

• Import into DB

Evaluate data

• Data views in Vagrant VM

• Or http://hadoop.bsc.es

PA and KD •Predictive

Analytics

•Knowledge Discovery

Historic Repo

Page 31: Benchmarking Hadoop and Big Data

34

Benchmarks Execution comparisons You can compare, side by side, all execution parameters:

CPU, Memory, Network, Disk, Hadoop parameters….

Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144

Page 32: Benchmarking Hadoop and Big Data

HiBench suite HiBench : A Benchmark Suite for Hadoop

4

HiBench

A Comprehensive & Realistic Benchmark Suite

Enhanced DFSIO

Micro Benchmarks Web Search

Sort

WordCount

TeraSort

Nutch Indexing

Page Rank

Machine Learning

Bayesian Classification

K-Means Clustering

HDFS

Code at: https://github.com/intel-hadoop/HiBench

Page 33: Benchmarking Hadoop and Big Data

Job resource requirements 1/2

Source: Intel HiBench

Page 34: Benchmarking Hadoop and Big Data

Job resource requirements 2/2

Source: Intel HiBench

Page 35: Benchmarking Hadoop and Big Data

Impact of SW configurations in Speedup

Number of mappers Compression algorithm

No comp.

ZLIB

BZIP2

snappy

4m

6m

8m

10m

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Page 36: Benchmarking Hadoop and Big Data

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local only

1 Remote

2 Remotes

3 Remotes

3 Remotes

/tmp local

2 Remotes /tmp local

1 Remotes

/tmp local

HDD-ETH

HDD-IB

SSD-ETH

SDD-IB

Speedup (higher is better)

Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Page 37: Benchmarking Hadoop and Big Data

Cost/Performance Scalability Terasort (100GB)

Sample from: http://hadoop.bsc.es/nodeseval

Execution time Execution cost

Page 38: Benchmarking Hadoop and Big Data

InfiniBand + SDD (LOCAL)

GbE SDD + (LOCAL) CLOUD (local disk /tmp and HDFS)

CLOUD (/tmp in Local Disk, HDFS in Blob storage 1-3 devices)

CLOUD (/tmp and HDFS in Blob storage 1-3 devices)

InfiniBand + SATA disks (LOCAL)

GbE+ SATA disks (LOCAL)

Price

Performance

Cost-effectiveness On-premise vs. Cloud)

Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Page 39: Benchmarking Hadoop and Big Data

Common Benchmarking pitfalls Scalability

Assuming near scalability

Compare apples to apples if benchmarking HW change HW but leave SW the same

Terasort in v1 != Terasort in v2

Test for Big Data use large data

stress the system

If results are too good to be true, they probably aren't Don’t believe in miracles

Expect vendor lies

Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain

Page 40: Benchmarking Hadoop and Big Data

Resources ALOJA Benchmarking platform and online repository

http://hadoop.bsc.es/

Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80organizations) http://clds.sdsc.edu/bdbc/community

Workshop Big Data Benchmarking (WBDB) Next: http://clds.sdsc.edu/wbdb2015.ca

SPEC Research Big Data working group http://research.spec.org/working-groups/big-data-working-group.html

Slides and video: Michael Frank on Big Data benchmarking

http://www.tele-task.de/archive/podcast/20430/

Tilmann Rabl Big Data Benchmarking Tutorial http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl

Page 41: Benchmarking Hadoop and Big Data

@BDOOP_BCN

Benchmarking Hadoop by Nicolas Poggi @ni_po

June 2, 2015