Download - Benchmarking Hadoop and Big Data
@BDOOP_BCN
Benchmarking Hadoop by Nicolas Poggi @ni_po
June 2, 2015
About Nicolas Poggi @ni_po
What is BDOOP about? ● A group to share on Data
● Scalability
● Performance
● Configurations
● Cluster design
● Benchmarking
● …a/couple of beer/s!
• Having sysadmins in mind
● Also POs
● Not a group to learn
• Java
• Mapreduce programming
• Hadoop base concepts
BDOOP Group Objectives ● Create a local community to
● Learn Big Data ● performance and scalability
● Share ● day-to-day problems and solutions
● Present your work and findings
● Have talks from renown experts
● > Your objective here <
Benchmarking Motivation and Intro
Hadoop design
Hadoop designed to solve complex data Structured and non structured
With [close to] linear scalability
Simplifying the programming model From MPI, OpenMP, CUDA, …
Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
Hadoop attributes Fault tolerant
from commodity hardware
Built in redundancy
via replication
Automatic scales out / down
With [almost] linear scalability
Move computation to data
minimize communication
Share nothing architecture
Hadoop highly-scalable but… Not a high-performance solution!
Requires Design,
Clusters, topology clusters
Setup, OS, Hadoop config
and tuning required Iterative approach
Time consuming
And extensive benchmarking!
Hadoop parameters > 100+ tunable parameters
mapred.map/reduce.tasks.speculative.execution
obscure and interrelated
io.sort.mb 100 (300)
io.sort.record.percent 5% (15%)
io.sort.spill.percent 80% (95 – 100%)
Number of Mappers and Reducers
Rule of thumb 0.5 - 2 per CPU core
Hadoop ecosystem
Large and spread
Dominated by big players
Custom patches
Default values not ideal
Product claims
Cloud vs. On-premise
IaaS
PaaS
EMR, HDInsight
Needs standardization and auditing!
DATA
Product claims Need auditing!
Workload (jobs) All jobs are different!
Different requirements CPU bound
Memory bound
I/O bound … a bit of all
Different tuning for each Needs benchmarking!
Terasort
K-means
Wordcount
Sample mappers and reducer for 3 popular benchmarks:
One for all config?
Vertical line: Average performance for this workload across configurations Values to the right: above average Values to the left: below average
Is there one software configuration iteration that fits everybody?
Co
nfi
gura
tio
ns
13
Good for Terasort but bad for Wordcount
Good for Terasort but bad for Wordcount
Good for Wordcount but very bad for Terasort
Example of SSD impact to Execution time
Impact of SSDs to running time of Terasort
SSDs
HDDs
Co
nfi
gura
tio
ns
SSD
SATA
Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
Gb Ethernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system configuration positioned on
each of these axes?
High availability
Replication
+
+
Benchmarks
Why benchmark? Validate assumptions
Reproduce bad behavior
Debugging
Measure performance and scale
Simulate higher load
Find bottlenecks / limits
Plan for growth
Test different
SW and HW
Source: Based on High Performance MySQL, benchmarking MySQL chapter
Benchmarking stakeholders and use cases
End-user / consumer Compare products
Developer Profiling
CI / QA
Sysadmin / architect Cluster sizing
SW and HW vendors Product claims
Marketing
Researcher …
Big Data Vs
Volume
Velocity Variety
Structured, semi, unstructured data Different types of data (genres)
Veracity Value
Sample scale factor from TPCx-HS
Data generation Real vs. Synthetic
Random data vs. repeatable
Data generation time
Paralle
Data distribution
Flat or uniformly distributed
Gaussian (normal distribution, skew)
Issues Benchmarking Big Data Big Scale
Single node vs Multiple nodes 10MB vs 10TB
On-metal vs. virtualized vs. cloud
Non-deterministic / Randomness
Need to average multiple runs
How long to benchmark
System warm-up
Distributed systems
Failures?
Types of benchmarks and Standards Micro benchmarks
HDFSIO
Functional
Terasort, ETL
Genre-specific
Graph 500
Application level
BigBench
TPC (implementation) vs SPEC (reference)
TPC vs. SPEC models
Specification based
Performance, price, energy in one benchmark
End-to-end
Multiple tests (ACID, load)
Independent review
Full disclosure
TPC Technology Conference
Kit based
Performance and energy in separate benchmarks
Server-centric
Single test
Peer review
Summary disclosure
SPEC Research Group, ICPE
Source: From presentation by Meikel Poess, 1stWBDB, May 2012
Data Benchmarks
Classical SQL OLAP DB Big Data
First there was TPC-H Classical SQL OLAP
benchmark
MRBench for M/R
On top of Hive or Impala for Hadoop
Then sorting Terasort
Unofficial standard
Now part of TPCx-HS
Hadoop samples Wordcount, grep, terasort, DFSIO
YCSB From Yahoo! For NoSQL, HBASE implementation
GridMix
CALDA
HiBench
SWIM
BigBench based on TCP-DS + ML 30 queries
BigDataBench 33 workloads
TPCx-HS
Comparison of popular Hadoop benchmarks
Spec[1
] App domains
Workload types
Workloads
Scalable data sets[2]
Diverse implem[3]
Multi- tenancy[4]
Subset[5] Simulator[6]
BigDataBench Y Five Four[7] Thirty-three [8]
Eight[9] Y Y Y Y
BigBench Y One Three Ten Three N N N N
CloudSuite N N/A Two Eight Three N N N Y
HiBench N N/A Two Ten Three N N N N
CALDA Y N/A One Five N/A Y N N N
YCSB Y N/A One Six N/A Y N N N
LinkBench Y N/A One Ten N/A Y N N N
AMP Benchmarks
Y N/A One Four N/A Y N N N
The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage
What to measure and metrics
Job execution time
Throughput Units / time
Framework overhead # of spills
Scalability
Concurrency
Abstract metrics
CPU
MEM
DISK IOPS, latency, bandwidth
NET Latency bandwidth
TPCx-HS performance metric (HSph@SF)
Benchmarking
Project ALOJA online repository Entry point for explore the results collected from the
executions,
Provides insights on the obtained results through continuously evolving data views.
Online results at: http://hadoop.bsc.es
ALOJA Platform: Evolution and status Benchmarking, Repository, and Analytics tools for Big Data
Composed of open-source Benchmarking, provisioning and orchestration tools,
high-level system performance metric collection,
low-level Hadoop instrumentation based on BSC Tools
and Web based data analytics tools And recommendations
Online Big Data Benchmark repository of: 20,000+ runs (from HiBench)
Sharable, comparable, repeatable, verifiable executions
Abstracting and leveraging tools for BD benchmarking Not reinventing the wheel but,
most current BD tools designed for production, not for benchmarking
leverages current compatible tools and projects
Dev VM toolset and sandbox via Vagrant
Big Data Benchmarking
Online Repository
Analytics
Workflow in ALOJA Cluster(s) definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD •Predictive
Analytics
•Knowledge Discovery
Historic Repo
34
Benchmarks Execution comparisons You can compare, side by side, all execution parameters:
CPU, Memory, Network, Disk, Hadoop parameters….
Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144
HiBench suite HiBench : A Benchmark Suite for Hadoop
4
HiBench
A Comprehensive & Realistic Benchmark Suite
Enhanced DFSIO
Micro Benchmarks Web Search
Sort
WordCount
TeraSort
Nutch Indexing
Page Rank
Machine Learning
Bayesian Classification
K-Means Clustering
HDFS
Code at: https://github.com/intel-hadoop/HiBench
Job resource requirements 1/2
Source: Intel HiBench
Job resource requirements 2/2
Source: Intel HiBench
Impact of SW configurations in Speedup
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Impact of HW configurations in Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes /tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Cost/Performance Scalability Terasort (100GB)
Sample from: http://hadoop.bsc.es/nodeseval
Execution time Execution cost
InfiniBand + SDD (LOCAL)
GbE SDD + (LOCAL) CLOUD (local disk /tmp and HDFS)
CLOUD (/tmp in Local Disk, HDFS in Blob storage 1-3 devices)
CLOUD (/tmp and HDFS in Blob storage 1-3 devices)
InfiniBand + SATA disks (LOCAL)
GbE+ SATA disks (LOCAL)
Price
Performance
Cost-effectiveness On-premise vs. Cloud)
Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Common Benchmarking pitfalls Scalability
Assuming near scalability
Compare apples to apples if benchmarking HW change HW but leave SW the same
Terasort in v1 != Terasort in v2
Test for Big Data use large data
stress the system
If results are too good to be true, they probably aren't Don’t believe in miracles
Expect vendor lies
Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
Resources ALOJA Benchmarking platform and online repository
http://hadoop.bsc.es/
Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80organizations) http://clds.sdsc.edu/bdbc/community
Workshop Big Data Benchmarking (WBDB) Next: http://clds.sdsc.edu/wbdb2015.ca
SPEC Research Big Data working group http://research.spec.org/working-groups/big-data-working-group.html
Slides and video: Michael Frank on Big Data benchmarking
http://www.tele-task.de/archive/podcast/20430/
Tilmann Rabl Big Data Benchmarking Tutorial http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
@BDOOP_BCN
Benchmarking Hadoop by Nicolas Poggi @ni_po
June 2, 2015