hadoop spark introduction-20150130

Spark Introduction Ziv Huang 2015/01/30

Purpose of this introduction

Help you to construct an idea of Spark regarding its

architecture, data flow, job scheduling, and programming.

Just to give you a picture, not all technical details!

Outline

What is Spark

Architecture

Application Workflow

Job Scheduling

Submitting Applications

Programming Guide

What is Spark A fast and general engine for large-scale data processing

(in Hadoop, it is a MapReduce replacer as long as the stability is

guaranteed and APIs are comprehensive enough)

Run programs up to 100x faster than Hadoop MapReduce in

memory, or 10x faster on disk.

Spark has an advanced DAG execution

engine that supports cyclic data flow and

in-memory computing.

Write applications quickly in Java, Scala or Python

Spark offers over 80 high-level operators

that make it easy to build parallel apps. And

you can use it interactively from the Scala

and Python shells.

What is Spark Combine SQL, streaming, and complex analytics

Spark powers a stack of high-level tools including Spark

SQL, MLlib for machine learning, GraphX, and Spark

Streaming. You can combine these libraries seamlessly in

the same application

Spark runs on Hadoop Yarn, Mesos, standalone, or in the cloud.

It can access diverse data sources including HDFS, Cassandra,

HBase, S3

Amazon Simple Storage Service

Outline

What is Spark

Architecture


Job Scheduling


Programming Guide

Architecture

Data Node Data Node Data Node

Spark

worker

Spark

worker

Spark

worker

Spark

driver

HDFS

1. launch job

2. Allocate resource 3. Assign task & return result

Cluster

manager

Components: Driver (your main program), Cluster manager, and Spark workers

Architecture


Spark

worker

Spark

worker

Spark

worker

Spark

driver

HDFS

1. launch job


Spark applications run as independent sets of processes on a cluster,

coordinated by the SparkContext object in the driver program (your main

program)

SparkContext

Note:

Driver and

Cluster manager

can be on the

same or

different

machines; can be

inside or outside

the HDFS

cluster

Cluster

manager

Architecture


Spark

worker

Spark

worker

Spark

worker

Spark

driver

Cluster

manager

HDFS

1. launch job


To run on a cluster, the SparkContext can connect to several types of cluster

managers (either Spark’s own standalone cluster manager or Mesos/YARN), which

allocate resources across applications

SparkContext

Spark

standalone

master; or

Yarn resource

manager; or

Mesos master

Architecture


Spark

worker

Spark

worker

Spark

worker

Spark

driver

Cluster

manager

HDFS

1. launch job


1. Once connected, Spark acquires executors on nodes in the cluster, which are

processes that run computations and store data for your application.

2. Next, it sends your application code (defined by JAR or Python files passed to

SparkContext) to the executors.

3. Finally, SparkContext sends tasks for the executors to run

SparkContext

Executor Executor Executor

Task Task Task Task Task Task

jar jar jar

Architecture : Notes

Each application gets its own executor processes, which stay up for

the duration of the whole application and run tasks in multiple

threads.

Pros Isolating applications from each other, on both

the scheduling side (each driver schedules its

own tasks) and executor side (tasks from

different applications run in different JVMs).

Cons Data cannot be shared across different Spark

applications (instances of SparkContext) without

writing it to an external storage system.

Architecture : Notes

Spark is agnostic to the underlying cluster manager. As long as it

can acquire executor processes, and these communicate with each

other, it is relatively easy to run it even on a cluster manager that

also supports other applications (e.g. Mesos/YARN)

Because the driver schedules tasks on the cluster, it should be run

close to the worker nodes, preferably on the same local area

network. If you’d like to send requests to the cluster remotely, it’s

better to open an RPC to the driver and have it submit operations

from nearby than to run a driver far away from the worker nodes.

Outline

What is Spark

Architecture


Job Scheduling


Programming Guide

Application Workflow (an example)

Input Output Depending on the driver program,

there could be many stages in an

application.

These are RDDs.

RDD (Resilient Distributed Datasets):

a collection of elements partitioned across the nodes of the cluster that

can be operated on in parallel

created by starting with a file in the Hadoop file system, or an existing

Scala collection in the driver program, and transforming it


These are RDDs.

RDD (Resilient Distributed Datasets):

a collection of elements partitioned across the nodes of the cluster that

can be operated on in parallel

created by starting with a file in the Hadoop file system, or an existing

Scala collection in the driver program, and transforming it

Users may also ask Spark to persist an RDD in

memory, allowing it to be reused efficiently

across parallel operations.

RDDs automatically recover from node failures


RDDs support only two types of operations: Transformations and Actions


create a new dataset from

an existing one

return a value to the driver program after

running a computation on the dataset

RDDs support only two types of operations: Transformations and Actions


• Laziness: The transformations are only computed when an action requires a

result to be returned to the driver program. This design enables Spark to

run more efficiently.

• You may also persist an RDD in memory using the persist (or cache) method,

in which case Spark will keep the elements around on the cluster for much

faster access the next time you query it.


Outline

What is Spark

Architecture


Job Scheduling


Programming Guide

Job Scheduling - Scheduling Across Applications

Standalone mode:

By default, applications run in FIFO (first-in-first-out) order, and

each application will try to use all available resource.

You can control the amount of resource an application uses by

setting the following parameters:

Parameters description

spark.cores.max Max number of cores that an

application uses

spark.executor.memory The amount of memory an

executor can use

• Yarn provides a fair scheduler to arrange resource among

applications; however, there is NO such a fair scheduler

available in Spark standalone mode.

Mesos:

To use static partitioning on Mesos, set the following

parameters:



spark.mesos.coarse Set to true to use static

resource partitioning

spark.cores.max limit each application’s

resource share as in the

standalone mode.

spark.executor.memory control the executor memory

Yarn:

Set the following parameters:



--num-executors option to the Spark YARN

client, controls how many

executors it will allocate on

the cluster

--executor-cores limit the number of cores an

executor can use

--executor-memory control the executor memory

• Note that none of the modes currently provide memory

sharing across applications.

• In future releases, in-memory storage systems such as Tachyon

will provide another approach to share RDDs.

Some advantages of using YARN

YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.

You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.

Spark standalone mode requires each application to run an executor on every node in the cluster - with YARN, you choose the number of executors to use.

YARN is the only cluster manager for Spark that supports security and Kerberized clusters.

a computer network authentication protocol which works on the basis

of 'tickets' to allow nodes communicating over a non-secure network

to prove their identity to one another in a secure manner.

Job Scheduling - Within an Application

By default, Spark’s scheduler runs jobs (actions) in FIFO fashion.

Each job is divided into “stages” (e.g. map and reduce phases),

The first job gets priority on all available resources while its

stages have tasks to launch, then the second job gets priority, etc.

Spark provides a fair scheduler --- which assigns tasks between jobs

in a “round robin” fashion, so that all jobs get a roughly equal share

of cluster resources.

later jobs may be delayed significantly !

To enable the fair scheduler, simply set

val conf = new SparkConf().setMaster(...).setAppName(...)

conf.set("spark.scheduler.mode", "FAIR")

val sc = new SparkContext(conf)

Note: you need this only if your application consists of more

than one stages

Supports grouping jobs into pools, and setting different scheduling

options (e.g. weight) for each pool.

Can be used to allocate resource equally among all users regardless

of how many concurrent jobs they have.

Each pool supports three properties:

schedulingMode

weight

minShare

See next page for detailed descriptions.

Job Scheduling - Scheduler Pools


properties description

schedulingMode FIFO or FAIR

weight This controls the pool’s share of the cluster relative to other pools.

By default, all pools have a weight of 1. If you give a specific pool a

weight of 2, for example, it will get 2x more resources as other

active pools. Setting a high weight such as 1000 also makes it

possible to implement priority between pools—in essence, the

weight-1000 pool will always get to launch tasks first whenever it

has jobs active.

minShare Apart from an overall weight, each pool can be given a minimum

shares (as a number of CPU cores) that the administrator would

like it to have. The fair scheduler always attempts to meet all active

pools’ minimum shares before redistributing extra resources

according to the weights. The minShare property can therefore be

another way to ensure that a pool can always get up to a certain

number of resources (e.g. 10 cores) quickly without giving it a high

priority for the rest of the cluster. By default, each

pool’s minShare is 0.

The pool properties can be set by creating an XML file and setting

a spark.scheduler.allocation.file property in your SparkConf.


conf.set("spark.scheduler.allocation.file", "/path/to/file")

The format of the XML file, for example:

<allocations>

<pool name="production">

<schedulingMode>FAIR</schedulingMode>

<weight>1</weight>

<minShare>2</minShare>

</pool>

<pool name="test">

<schedulingMode>FIFO</schedulingMode>

<weight>2</weight>

<minShare>3</minShare>

</pool>

</allocations>

https://spark.apache.org/docs/1.2.0/configuration.html#spark-properties

Outline

What is Spark

Architecture


Job Scheduling


Programming Guide


Once a user application is bundled, it can be launched using the

bin/spark-submit script.

./bin/spark-submit \

--class <main-class>

--master <master-url> \

--deploy-mode <deploy-mode> \

--conf <key>=<value> \

... # other options

<application-jar> \

[application-arguments]

See next page for detailed descriptions for each argument

This is the only way to submit jobs to Spark!

Submitting Applications Argument Description

--class The entry point for your application

(e.g. org.apache.spark.examples.SparkPi)

--master The master URL for the cluster

(e.g. spark://23.195.26.187:7077)

--deploy-mode Whether to deploy your driver on the worker nodes

(cluster) or locally as an external client (client)

(default: client)

--conf Arbitrary Spark configuration property in key=value

format. For values that contain spaces wrap

“key=value” in quotes (as shown)

application-jar Path to a bundled jar including your application and

all dependencies. The URL must be globally visible

inside of your cluster, for instance, an hdfs:// path or

a file:// path that is present on all nodes.

application-arguments Arguments passed to the main method of your main

class, if any

http://spark.apache.org/docs/1.2.0/submitting-applications.html#master-urls


Something to say about deploy mode:

Question Answer

When to use client mode?

When your want to submit your application

from a gateway machine that is physically co-

located with your worker machines (e.g. Master

node in a standalone EC2 cluster)

When to use cluster mode?

When your application is submitted from a

machine far from the worker machines (e.g.

locally on your laptop)

For minimizing network latency between the

drivers and the executors!

For more information, see

http://spark.apache.org/docs/1.2.0/submitting-applications.html

Outline

What is Spark

Architecture


Job Scheduling


Programming Guide

Programming Guide Linking with Spark

Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark

supports lambda expressions for concisely writing functions, otherwise

you can use the classes in the org.apache.spark.api.java.function package.

To write a Spark application in Java, you need to add a dependency

on Spark. Spark is available through Maven Central at

groupId = org.apache.spark artifactId = spark-core_2.10 version = 1.2.0

If you wish to access an HDFS cluster, you need to add a

dependency on hadoop-client for your version of HDFS

groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version>

Programming Guide Linking with Spark

Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark

supports lambda expressions for concisely writing functions, otherwise

you can use the classes in the org.apache.spark.api.java.function package.

To write a Spark application in Java, you need to add a dependency

on Spark. Spark is available through Maven Central at

groupId = org.apache.spark artifactId = spark-core_2.10 version = 1.2.0

If you wish to access an HDFS cluster, you need to add a

dependency on hadoop-client for your version of HDFS

groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version>

I recommend you to make use of Maven for

dependency management – it will save you a lot of

time

Programming Guide The first thing a Spark program must do is to create a

JavaSparkContext object, which tells Spark how to access a cluster.

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaSparkContext sc = new JavaSparkContext(conf);

Next, create RDDs, this can be done by

parallelizing an existing collection in your driver program

or referencing a dataset in a shared filesystem, HDFS, HBase, or

any data source offering a Hadoop InputFormat

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);

JavaRDD<Integer> distData = sc.parallelize(data);

JavaRDD<String> distFile = sc.textFile("data.txt");

This method takes an URI for the file (either a local path on the machine, or a

hdfs://, s3n://, etc URI) and reads it as a collection of lines

Programming Guide

Next, use RDD operations

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

The first line defines a base RDD from an external file. This

dataset is not loaded in memory or otherwise acted on: lines is

merely a pointer to the file

The second line defines lineLengths as the result of a map

transformation. Again, lineLengths is not immediately computed,

due to laziness.

“reduce” is an action. At this point Spark breaks the computation into

tasks to run on separate machines, and each machine runs both its

part of the map and a local reduction, returning only its answer to the

driver program

Programming Guide

If we also want to use lineLengths again later, we could add

lineLengths.persist();

before calling “reduce”, which would cause lineLengths to be

saved in memory after the first time it is computed.

For more programming guide, see

http://spark.apache.org/docs/1.2.0/programming-guide.html

Issues I encountered in using Spark

When there are more than (approx.) 120 input sequenceFiles

the Spark task may hang there (forever or several minutes

before it fails).

This is a bug, the workaround is to use “coalesce”, which

circumvents the bug and, in addtion, improves the computation

performance

If you want to run Spark on Yarn, your native libs must be placed in

the right place, see discussion here

http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-

spark-job-on-yarn-with-jni-lib-td15146.html

In Yarn, I find no way to retrieve job status.

In Standalone mode, I grab job status from parsing html string in

WebUI – not good, but I have no other way but this way.

Issues I encountered in using Spark

Stability and pressure tests are not conducted yet.

I don’t know what will happen if we let Spark stand there for 2

weeks, 3 weeks, or even longer.

I don’t know what will happen if we run a task with data >= 100G,

or run 10 consecutive tasks with each handling 10G data

HA – you may run may Spark masters (one active and the others

standby) by using Zookeeper for failover, but how to get job status

if the active master goes down?

Thank you!

42

hadoop spark introduction-20150130

Software