apache spark operations

Spark OperationsKostas Sakellis

• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager

Building a proof of concept!

Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

…RDD …RDD

RDD Lineage

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Lineage

…RDD …RDD

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

•A pipelined set of transformation on a single thread

Spark Architecture

Spark System Architecture

Deployments

• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos

• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support

Standalone

Master

WorkerClient

Worker

Process

AppMaster

Process

Standalone

•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>

• Submit jobspark-submit --master <master-spark-URL>

Container

YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Spark on YARN

• Submit jobspark-submit --master yarn-client …

•Cluster modespark-submit --master yarn-cluster …

• Spark shell only works in client mode!

Customers often have shared infrastructure

Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

Multi-tenancy

•Cluster utilization is top metric•Target: 70-80% utilization

•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager

Underutilized Clusters

Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

Dynamic Allocation

• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed

• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell

Dynamic Allocation Limitations

• Still required to specify cores•--num-cores

•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself

•Our customers still get it wrong!

The Future of Dynamic Allocation

•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory

• Leads to better cluster utilization

Security, now it’s getting serious.

Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

Authentication

•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.

• Spark utilizes delegation tokens

Encryption

•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)

SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

Authorization

• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity

•Apache Sentry•HDFS-Sentry synchronization plugin

•Record Service•Column level security for Spark!

Thank youWe’re Hiring!

apache spark operations

Software

apache spark - courses€¦ · apache spark introduction to...

developing apache spark applications · apache spark...

apache spark rdds

apache spark session

training catalog -...

knime extension for apache spark installation guide ·...

budapest spark meetup - apache spark @enbrite.ly

r + apache spark

introduction to cassandra • why spark - apache cassandra |...

performance-analyse von apache spark und apache...

a tutorial on apache spark - michael...

introduction to apache spark -...

developing apache spark applications - cloudera · apache...

using apache spark pat mcdonough - databricks. apache spark...

managed solutions apache spark® · apache spark® apache...

integrating apache hive with kafka, spark, and...

apache spark intro

apache spark briefing

apache spark overview

apache spark 2.0