in a fast-moving world transforming cloudera analytics agility · 2019-10-04 · (spark-sql |...

23
Transforming Cloudera analytics agility in a fast-moving world Brent Compton, Sr Director Tech Marketing, Red Hat Idan Levi, formerly at GoI, now in Office of CTO, Red Hat May 2019

Upload: others

Post on 26-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Transforming Cloudera analytics agilityin a fast-moving world

Brent Compton, Sr Director Tech Marketing, Red HatIdan Levi, formerly at GoI, now in Office of CTO, Red Hat

May 2019

Page 2: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

The Story of Unlocking

ComputeData Innovation

Page 3: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

It’s All About Time to Market

~ 4-5 Months ~ 8-14 Minutes

Page 4: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Object Storage (S3-compatible)

Data Sources

Insights

Cloudoop

GaaS

Compute

The Data Factory Vision

Example: Machine learning pipeline

Page 5: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloudoop

Page 6: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Native Hadoop vs. Cloudoop

. . .. . . . . .

. . .

Compute Service

Object storage Service (S3)

Page 7: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloud Approach - Unlocking

• Elasticity

• Compute-data decoupling

• Scalability

• Dynamic

• Multi tenancy

• SLA

Page 8: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloud Approach – AWS experience on-prem

persistent cluster – interactive queries(Spark-SQL | Impala)

transient cluster - batch jobs(X hours nightly) – add/remove nodes

workload specific clusters(different sizes, different versions)

S3-compatible object store

Page 9: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloud Approach Switch Off Clusters

S3 S3 S3

Page 10: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloud ApproachLogical Separation of Jobs

Hive Sparkjob

Impala Ad-hoc

S3

As you wish…

Page 11: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloud ApproachLogical Separation of Jobs

Hive SparkJob

Impala Ad-Hoc

As you wish…

Page 12: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

High Level Architecture

Cloud Management

Hardware

Big DataPlatforms

Low Level API Hardware discovery & control

High Level API

Hyper scale

Software-defined everything

PaaS / IaaS services

Cloudera

Page 13: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

HDFS

bare metal servers

Red Hat Ceph

Red Hat OpenStack

Page 14: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Proof Of Concept

Page 15: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Fine tune – Spark Terasort Before & After

* Lower is better

Page 16: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Fine tune - TPC-DS Before & After

* Lower is better

Page 17: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

TPC-DS Benchmarks

Lower is better *

Page 18: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Testimonials

Lower is better

**

** Running on local SSDs

Page 19: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Cloudoop – Phase 1• Versions

• CDH 5.13

• Spark 2.2

• APIs

• Create cluster

• Delete cluster

• List clusters

• Describe cluster

• Fine tuned cluster with full management access

Page 20: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Object Storage (Ceph)

Data Sources

Insights

Cloudoop

GaaS

Compute

The Data Factory Vision

Example: Machine learning pipeline

Page 21: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

Strategic partnership

Page 22: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload

● At the Storage lockers● At the Red Hat booth● At one of Storage dedicated sessions (red.ht/storageatsummit)● At the Community Happy Hour (Tues 6:30, Harpoon Brewery)● At the Hybrid Cloud Party (Wed, 7:30, “Committee” restaurant)

FIND US AT RED HAT SUMMIT

Red Hat OpenShift Container Storagered.ht/videos-RHOCS

Red Hat data analytics infrastructure solutionred.ht/videos-RHDAIS

Red Hat Hyperconverged Infrastructurered.ht/videos-RHHI

redhat.com/storage

@redhatstorage

redhatstorage.redhat.com

Page 23: in a fast-moving world Transforming Cloudera analytics agility · 2019-10-04 · (Spark-SQL | Impala) transient cluster - batch jobs (X hours nightly) – add/remove nodes workload