apache spark operations

Post on 17-Jan-2017

975 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1© Cloudera, Inc. All rights reserved.

Spark OperationsKostas Sakellis

2© Cloudera, Inc. All rights reserved.

Me

• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager

3© Cloudera, Inc. All rights reserved.

Building a proof of concept!

Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

4© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

5© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

6© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

7© Cloudera, Inc. All rights reserved.

Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

8© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

9© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

10© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

11© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

12© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDD Lineage

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Lineage

13© Cloudera, Inc. All rights reserved.

Task

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

•A pipelined set of transformation on a single thread

14© Cloudera, Inc. All rights reserved.

Spark Architecture

15© Cloudera, Inc. All rights reserved.

Spark System Architecture

16© Cloudera, Inc. All rights reserved.

Deployments

• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos

• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support

17© Cloudera, Inc. All rights reserved.

Standalone

Master

WorkerClient

Worker

Process

AppMaster

Process

18© Cloudera, Inc. All rights reserved.

Standalone

•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>

• Submit jobspark-submit --master <master-spark-URL>

19© Cloudera, Inc. All rights reserved.

Container

YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

20© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

21© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

22© Cloudera, Inc. All rights reserved.

Spark on YARN

• Submit jobspark-submit --master yarn-client …

•Cluster modespark-submit --master yarn-cluster …

• Spark shell only works in client mode!

23© Cloudera, Inc. All rights reserved.

Customers often have shared infrastructure

Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

24© Cloudera, Inc. All rights reserved.

Multi-tenancy

•Cluster utilization is top metric•Target: 70-80% utilization

•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager

25© Cloudera, Inc. All rights reserved.

Underutilized Clusters

Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

26© Cloudera, Inc. All rights reserved.

Dynamic Allocation

• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed

• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell

27© Cloudera, Inc. All rights reserved.

Dynamic Allocation Limitations

• Still required to specify cores•--num-cores

•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself

•Our customers still get it wrong!

28© Cloudera, Inc. All rights reserved.

The Future of Dynamic Allocation

•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory

• Leads to better cluster utilization

29© Cloudera, Inc. All rights reserved.

Security, now it’s getting serious.

Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

30© Cloudera, Inc. All rights reserved.

Authentication

•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.

• Spark utilizes delegation tokens

31© Cloudera, Inc. All rights reserved.

Encryption

•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)

SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

32© Cloudera, Inc. All rights reserved.

Authorization

• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity

•Apache Sentry•HDFS-Sentry synchronization plugin

•Record Service•Column level security for Spark!

33© Cloudera, Inc. All rights reserved.

Thank youWe’re Hiring!

top related