the state of sql-on-hadoop in the cloud

The state of SQL-on-Hadoop in the Cloud

By Nicolas Poggi

Lead researcher – Big Data Frameworks

Data Centric Computing (DCC) Research Group

BDOOP Meetup – Dec 2016

Agenda

• Intro and motivation

• PaaS services overview• Instances comparison

• SW and HW specs• Elasticity• Perf metrics

• SQL Benchmark• Test methodology• Evaluations

• Execution times• Data size scalability

• Price / Performance

• PaaS evolution over time• SW and HW improvements

• Summary• Lessons learned• Conclusions & future work

2

ALOJA: towards cost-effective Big Data

• Open research project for automating characterization and optimization of Big Data deployments

• Open source Benchmarking-to-Insights platform and tools

• Largest Big Data public repository• Community collaboration with industry and academia

• Preliminary to this study:• Big Data Benchmark Compendium (TPC-TC `15)

• The Benefits of Hadoop as PaaS (Hadoop Summit EU `16)

http://aloja.bsc.es

Big Data Benchmarking

Online Repository

Web / ML

Analytics

Motivation of SQL-on-Hadoop study

• Extend the ALOJA platform to survey popular PaaS SQL Big Data Cloud solutions using Hive [to begin with]

• First approach to services, from an end-user’s perspective• Using the public cloud (and pricing), online docs, and resources• Medium-size test deployments and data (8 data-nodes, up to 1TB)

• Evaluate and compare out-of-the-box (default VMs and config)

• Architectural differences, readiness, competitive advantages • Scalability, Price and Performance

Disclaimer: snapshot of the out-of-the-box price and performance during March-July 2016. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark.

5

Platform-as-a-Service Big Data

• Cloud-based managed Hadoop services• Ready to use Hive, spark, …

• Simplified management

• Deploys in minutes, on-demand, elastic• You select the instance and

• the number of processing nodes

• Pay-as-you-go, pay-as-you-process models

• Optimized for general purpose• Fined tuned to the cloud provider architecture

6

Surveyed Hadoop/Hive PaaS services

• Amazon Elastic Map Reduce (EMR)• Released: Apr 2009• OS: Amazon Linux AMI 4.4 (RHEL-like)• SW stack: EMR (custom, 4.7*)• Instances:

• m3.xlarge and m4.xlarge

• Google Cloud DataProc (CDP)• Released: Feb 2016• OS: Debian GNU/Linux 8.4• SW stack: (custom, v1)• Instances:

• n1-standard-4 and n1-standard-8

• Azure HDInsight (HDI) • Released: Oct 2013• OS: Windows Server and Ubuntu 14.04.5 LTS• SW stack: HDP based (v 2.3 and 2.4 **)• Instances:

• A3s, D3s v1-2, and D4s v1-2

• Rackspace Cloud Big Data (CBD) • Released: Oct 2013• OS: CentOS 7• SW stack: HDP (2.3)• API: OpenStack (+ Lava)• Instances:

• Hadoop 1-7, 1-15, 1-30, On Metal 40

We selected defaults, general purpose VMs, Also on-premises results as baseline.* EMR v5 released in Aug 2016. ** HDI offers HDP 2.5 since Sept 2016

7

Systems-Under-Test (SUTs):VM/Instance specs, elasticity, perf characterization

Focus: 8-datanodes, up to 1TB data size

8

SUTs: Tech specs and costs

* Estimate based on 3 years life time including support and maintenance (see refs.) 10

Notes:• Default Cloud SKUs have 4

cores and ~15GB of in all providers• 4GBs of RAM / core

• Prices vary greatly

• Rackspace defaults to high-end OnMetal

Provider Instance type Default? Cores/Node RAM/Node RAM/core

Amazon EMR(us-east-1)

m3.xlarge Yes 4 15 3.8

m4.xlarge 4 16 4

Google CDP(Europe-west1-b)

n1-standard-4 Yes 4 15 3.8

n1-standard-4 1 SSD 4 15 3.8

n1-standard-8 8 30 7.5

Azure HDI(South Central US)

A3 (Large) (old def.) 4 7 1.8

D3 v1 and v2 Yes 4 14 3.5

D4 v1 and v2 8 28 3.5

Rackspace CBD(Northern Virginia (IAD))

hadoop1-7 2 7 3.5

hadoop1-15 (2nd) 4 15 3.8

hadoop1-30 8 30 3.8

OnMetal 40 Yes 40 128 3.2

On-premises2012 (12cores/64GB) 12 64 5.3

D Nodes Cost/Hour Cluster Shared

8 $ 3.36 Yes

8 $ 2.99 Yes

8 $ 1.81 Yes

8 $ 1.92 Yes

8 $ 3.61 Yes

8 $ 2.70 Yes

8 $ 5.25 Yes

8 $ 10.48 Yes

8 $ 2.72 Yes

8 $ 5.44 Yes

8 $ 10.88 Yes

4 $ 11.80 No

8 $ 3.50 * No

Includes I/O costs Cost/5TB/hr* Deploy time (min)

Yes / No with EBS$ 0.07

~ 10

No ~ 10

No

$ 0.18

~ 01

No ~ 01

No ~ 01

No

$ 0.17

~ 25

No ~ 25

No ~ 25

YesLocal

$ 0.00Cloud

$ 0.07

~ 25

Yes ~ 25

Yes ~ 25

Yes ~ 25

Yes $ 0.00 N/A

SUTs: Elasticity and I/O

*Tests need 5TB of raw HDFS storage, this cost is used. **supports up to 4 SSD drives 12

Provider Instance type Elasticity Storage

Amazon EMRm3.xlarge Compute (and EBS option) 2x40GB Local SSD / node

m4.xlarge Compute and EBS (fixed size) EBS size defined on deploy

Google CDP

n1-standard-4

Compute and GCS (fixed size )

GCS size defined on deploy

n1-standard-4 1 SSD 1x375GB SSD ** + GCS

n1-standard-8 GCS size defined on deploy

Azure HDI

A3 (Large)

Compute and storage

Elastic (WASB)

D3 v1 and 2 Elastic (WASB) + 200GB SSD local

D4 v1 and 2 Elastic (WASB) + 400GB SSD local

Rackspace CBD

hadoop1-7

Compute (Cloud files option)

1.5TB SATA / node

hadoop1-15 2.5TB SATA / node

hadoop1-30 5TB SATA / node

OnMetal 40 2x1.5TB SSD / node

On-premises 2012 (12cores/64GB) No 1TB SATA x6 / node

SUTs: CPU, MEM, and NET characterization

MEM: Ops/s (sysbench 1-thread, default blk size)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

Higher is better. Defaults in Purple. Benchmarked using ALOJA cluster-bench, sysbench, and iperf 15

EMR m3.x, CBD hadoop1-x, and HDI A3 older gen CPUs

OnMetal much higher

NET: iperf Gb/s (1 thread 100GB)

8.77 8.77

6.85

5.3

7.41

0.999

2

2.99

2.47

5.97

4.134.4

3.98

4.49

0.941

0

1

2

3

4

5

6

7

8

9

10EMR < 10Gbp/s

CDP < 8Gbp/s

HDI < 6Gbp/s(SKU dependent)

CBD < 5Gbp/s

OnPrem 1Gbp/s

30.3814 35.813949.5461

320.8429

37.1409 29.9051 31.970346.30423

56.575

29.2788 28.7281

73.1942

615.449

49.5514

15.53331.128 28.9609

90.5373

12.2101 16.308435.0701 37.924

50.0872

1.9575 3.5915 4.8559

315.7558

227.9924

DFSIO_read MB/s

DFSIO_write MB/s

SUTs: HDFS I/O 1TB

17

CBDHighest OnMetal

Very low write throughput for

Cloud

CDPSSDs are used as

tiered storage

EMRSKU

dependentLow MB/s

HDIThroughput is SKU dependent

Higher is better. Benchmarked using ALOJA Hadoop-Examples

Most SUTs < 50 MB/s R/WExpect for OnMetal and CDP with SSD

SQL-on-Hadoop benchmarkingMethodology and evaluations

18

Benchmark suite: TPC-H (derived)

• DB industry standard for decision support• well understood benchmark and accepted (since `99)• available audited results on-line

• 22 “real world” business queries• Complex joins, grouping, nested queries

• Defines scale factors for data

• DDLs and queries from D2F-Bench project:• Includes Hive adaptation with ORC tables • Repo: https://github.com/Aloja/D2F-Bench

• based on https://github.com/hortonworks/hive-testbench• changes make it HDP agnostic

• Supports other engines: Spark, pig, impala, drill, …

19

TPC-H 8-tables schema

https://github.com/Aloja/D2F-Bench

https://github.com/hortonworks/hive-testbench

Test methodology

• ALOJA-BENCH as a driver• Test methodology

• Queries run from 1-22 • sequentially

• To try to avoid caches• [at least] 3 repetitions• Query ALL (Q ALL) as full run

• Power runs (no concurrency)

• Data sizes:• 1GB, 10GB, 100GB, 500GB*, 1TB

• Metrics: execution time and cost• Comparisons

• Q ALL (full run)• Scans Q1, and Q6, • Joins Q2, Q16• Q16 most “complete” single query

• Process and settings• TCP-H datagen CSVs converted to

Hive ORC tables• Each system its own hive.settings

• On-prem from source repo

20*500GB is not a standard size, but 300GB is.

SUTs Performance and ScalabilityExecution times

Scalability to data size

Query drill-down

Latency test

21

Exec times by SUT: 8dn 100GB Q ALL

Notes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 100GB. Except for the CBD-on metal which has 4dns.

• CBD: • OnMetal fast• Cloud, scale to SKU size

• CDP:• SSD slightly faster than regular• N1std8 only 30% faster than

N1std4• EMR:

• m4.xlarge 18% faster than m3.xlarge

• HDI:• Scale to SKU size• Fastest result D4v2

• OnPrem:• Poor results with M/R

• A3s and CBD Cloud present high variability

22

CBD CDP EMR HDI

SSD version marginal results

Local SSD + EBS

OnMetal

D4v2 Fastest

D3v2 fastest defaultEBS Only

OnPrem

Exec times by SKU: 8dn 1TB Q ALL

23

Notes:• Results show execution times for full TPC-H, on SKUs with 8 data nodes at 1TB. Except for the CBD-on metal which has 4dns.• At 1TB, lower end systems obtain poorer performance.• CBD:

• OnMetal fast• Cloud: 1-7 cannot process 1TB, 1-

15,1-30 similar results• CDP:

• SSD slightly slower than regular• N1std8 2x faster than N1std4 (as

expected)• EMR:

• m4.xlarge 15% faster than m3.xlarge

• HDI:• Scale to SKU size• Fastest result D4v2

• OnPrem:• Improves results (comparing to

100GB)

Systems similar, but poor results

OnMetal2nd fastestCloud D4v2 Fastest

CBD CDP EMR HDI

Data size scalability of defaults: up to 1TB (Q ALL)

24

Notes:• Chart shows the data

scale factor from 100GB to 1TB of the default SUTs of 8 data nodes. Except for CBD On Metal, which has 4.

• Comparing defaults instances, CDP has poorest scalability, then EMR.

• On-prem scales linearly up to 1TB

• HDI and OnMetal can scale to larger sizes

Exec times defaults: Scans vs. Joins 1TBScans (parallelizable Q1 CPU, Q6 I/O) Joins (less parallelizable Q2, Q16)

27Notes: Q1 (I/O + CPU) is slow on the CDP and EMR systems. Same for Q16. CBD has inverted times for Q2 and Q16 than other systems. On metal fastest for I/O and Joins, then HDI D3v2.

Defaults with 4-cores Defaults with 4-cores

CPU utilization default VMs: Q16 1TB

Notes: EMR high CPU and Sys. HDI different pattern (spikes). CDP high iowait.CBD OnMetal Low usage, hadoop1-15 very high iowait 29

Configurations

32Notes: CDP and CBD on Java 1.8, all on OpenJDK. HDI only to enable Tez and config perf options

Category Config EMR CDP HDI CBD (On Metal) On-prem

System Java version OpenJDK 1.7.0_111 OpenJDK 1.8.0_91 OpenJDK 1.7.0_101 OpenJDK 1.8.0_71 JDK 1.7

HDFS File system EBS / S3 GCS (hadoopv.) WASB Local + Swift + S3 Local

Replication 3 2 3 2 3

Block size 128MB 128MB 128MB 256MB 128MB

File buffer size 4KB 64KB 128KB 256KB 64KB

M/R Outputcompression SNAPPY False False SNAPPY False

IO Factor / MB 48 /200 10 /100 100 / 614 100 / 358 10 / 100

Memory MB 1536 3072 1536 2048 1536 / 4096

Hive Engine MR MR Tez MR MR / Tez

ORC config Defaults Defaults Defaults Defaults Defaults

Vectorized exec False False Enabled False Enabled

Cost-based Opt False Enabled Enabled Enabled Enabled

Enforce Bucketing False False True False True

Optimizebucket map join False False True False True

Latency test: Exec time by SKU 8dn 1GB Q 16

Notes:• Results show execution times for query 16 and 1GB. Except for the CBD-on metal which has 4.

• HDI D3v2 and D4v2 have the lowest times• Then the CDP systems• OnPrem M/R worst restuls

33

CBD CDP EMR HDI

D3v2 and D4v2“lowest latency”

Price / PerformancePrice and Execution times assume:

• only cost of running benchmark or full 24/7 utilization

• no provisioning time or idle times

• by the second billing

34

Price/Performance 100GB (Q ALL)

35

Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters

Price assumptions:• Measures only the cost of running the

benchmark in seconds. Cluster setup time is ignored.

Rank Cluster Best cost Best time1 CDP-n1std4-8 $ 6.37 3:11:57

2 CDP-n1std4-1SSD-8 $ 6.55 3:06:44

3 EMR-m4.xlarge-8 $ 8.18 2:40:24

4 HDI-D3v2-HDP24-8 $ 8.74 1:36:45

5 CDP-n1std8-8 $ 9.35 2:27:57

6 HDI-D4v2-HDP24-8 $ 10.20 0:57:29

7 EMR-m3.xlarge-8 $ 10.79 3:08:49

8 HDI-A3-8 $ 11.96 4:10:04

9 M100-8n $ 13.10 3:32:29

10 HDI-D4-8 $ 15.08 1:24:59

11 CBD-hadoop1-7-8 $ 19.16 7:02:33

12 CBD-OnMetal40-4 $ 19.31 1:38:12

13 CBD-hadoop1-15-8 $ 26.45 4:51:41

Cheapest run

Fastest run

Most Cost-effective

Price/Performance 1TB (Q ALL)

36

Notes:• Shows the price/performance ratio by SUT• Lower in price and time is better• Chart zoomed to differentiate clusters

Price assumptions:• Measures only the cost of running the

benchmark in seconds. Cluster setup time is ignored.

Rank Cluster Best cost Best time1 HDI-D3v2-HDP24-8 $ 39.63 7:18:42

2 HDI-D4v2-HDP24-8 $ 42.02 3:56:45

3 M100-8n $ 42.85 11:34:50

4 CDP-n1std8-8 $ 44.91 11:50:46

5 CDP-n1std4-8 $ 46.49 23:21:05

6 CDP-n1std4-1SSD-8 $ 50.53 24:00:52

7 EMR-m4.xlarge-8 $ 54.26 17:44:01

8 HDI-D4-8 $ 62.75 5:53:32

9 CBD-OnMetal40-4 $ 67.77 5:44:36

10 EMR-m3.xlarge-8 $ 69.92 20:23:01

11 HDI-A3-8 $ 74.83 27:42:56

12 CBD-hadoop1-15-8 $ 128.44 23:36:37

Cheapest run

Fastest run

Most cost effective

HW and SW improvementsPaaS provider improvements over time (tests on 4 data nodes)

37

HDI default HW improvement: 4 nodes Q ALL

Notes:• Test to compare perf improvements on HDI default VM instances from A3, to

D3 and D3v2 (30% faster CPU, same price) on HDP 2.338

HDI default VM improvement

Run time at 1TB Scalability from 1GB to 1TB

Variability

SW: HDP version 2.3 to 2.4 improvement on HDI D3v1 4 nodes Q ALL 100GB

Notes:• Test to compare migration to HDP 2.4. D3s improved, they can now run 1TB without modifications

on 4 data nodes (D3s). No more namenode swapping. On larger nodes less improvements.39

D3s 35% Improvement

Run time at 100GB Scalability from 1GB to 1TB

D3s can scale to 1TB now

SW: EMR version 4.7 to 5.0 improvement on m4.xlarge 4 nodes Q ALL

Notes:• Test to compare perf improvements on EMR 5.0 (Hive 2.1, Tez by default, Spark 2.0)• EMR 5.0 gets a 2x increase at 4 nodes.

40

ERM 5.0 2x improvement

Run time at 1TB Scalability from 1GB to 1TB

Summary Lessons learned, findings, conclusions, references

41

Remarks / Findings

• Setting up and fine tuning Big Data stacks is complex and requires and iterative process• Cloud services optimize continuously their PaaS for general-purpose

• All tune M/R and Yarn, and their custom file storages• Update HW (and prices) overtime

• You might need to re-deploy to get benefits

• Room for improvement• Only HDI fine-tunes Hive, what about other new services? (Spark, Storm, R, HBASE)• All updating to Hive and Spark v2 (and enabling Tez, tuning ORC)• CDP upgrading HDP version

• Beware, commodity VMs != commodity Bare-Metal for Big Data• Errors … Originally this was to be a 4-node comparison … • Variability, An issue for low-end, old-gen VMs

• Also scalability, and reliability, beware.• Less of an issue on newer VMs

• Scale by adding more nodes, not disks• Network throttling, not apparent at 8-datanode cluster, but for larger clusters…

42

Summary:

Similarities

• Similar defaults for cloud based:• 4-cores, ~16GB RAM, local SSDs

• ~4GB RAM / Core• Good enough for Hadoop / Hive

• Elasticity• All allow on-demand scaling-up• Mixed mode of local + remote

• Fast networking• Especially EMR• HDI, depending on VM size• Required for networked storage…

• Most deploy in < 25 mins

Differences

• CBD offers OnMetal as default• High-end, non-shared system.

• What about in-mem systems• Spark, Graph/Graph?

• Elasticity• But no all down-scaling / stop (delete)• HDI completely for disks (local for temp)

• Pricing, very different!• EMR, CBD, HDI / hour• CDP / minute• But similar overall price/perf (<30%)

• CDP deploys in 1 minute

43


• Providers have integrated successfully on-demand Big Data services• Most are in the path to offer pay-what-you process models• Disaggregating completely storage-to-compute• Giving more elasticity to your data and needs

• Multiple clusters, pay only what you use, planning free, governance

• What about performance and reliability?• Providers are upgrading and defaulting to newer-gen VMs

• Faster CPUs, SSDs (local and remote), end-of-rotational?, fast networks• As well as keeping the SW up-to date

• Newer versions, security and performance patches, tuned for their infrastructure

• Is it price-performant?• Yes, at least for the medium-sized. The cost is in compute, so you pay for what you use!

• For ALOJA, this work is the basis for future research

44

More info:

• Upcoming publication:

• The state of SQL-on-Hadoop in the Cloud• Data release and more in-depth tech analysis

• ALOJA Benchmarking platform and online repository• http://aloja.bsc.es http://aloja.bsc.es/publications

• SPEC Research Big Data working group• http://research.spec.org/working-groups/big-data-working-group.html

http://aloja.bsc.es/

http://aloja.bsc.es/publications

http://research.spec.org/working-groups/big-data-working-group.html

Thanks, questions?

Follow up / feedback : [email protected]

Twitter: @ni_po


mailto:[email protected]

the state of sql-on-hadoop in the cloud

Data & Analytics