getting started with apache spark

Post on 15-Apr-2017

287 Views

Category:

Software

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache SparkStream Programming and Distributed Data Processing

Habib Ahmed BhuttoSenior Software Engineer

iConnect360

Outline

• What’s Spark• Why Spark• Fundamental concepts• Cluster Deployment • Spark Streaming• Application Development• Deployment • Application Monitoring • Debugging

What’s Spark

• Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab,

University of California, Berkeley as sub-project of Hadoop

• Now it’s Apache’s

Why Spark

• Speed • Ease of use• Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud)

• Fault Tolerance • Integration • Deployment

Fundamental Concepts

• What exactly it does

Hadoop execution flow

Spark execution flow

Fundamental Concepts

• How exactly it does

Fundamental Concepts

• Resilient Distributed Dataset (RDD)– Abstraction – Immutable – Partitioned collection– Operated on in parallel

• RDD Operations – Actions – Transformations

• Spark Context

Fundamental Concepts

• Driver Program• Cluster Manager• Worker Node• Executer• Job • Stage• Task• Application Jar• Deploy Mode

Cluster Deployment

• Standalone• Amazon EC2 • Apache Mesos • Hadoop Yarn

Cluster Deployment

• Master page to monitor your cluster – http://<server-url>:8080

Spark Streaming

• How it works

Spark Streaming

• How it works internally

Spark Streaming

Spark Streaming

• Discretised Streams– Abstraction – Continuous Stream– Input data/ processed data – Series of RDDs

Spark Streaming

• Any operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming

• Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as

distributed SQL query engine.

Application Development

• Spark-Shell – Code in Scala with instant execution

Application Development

• Self-Contained Applications – Dependencies /Linking Libraries

Application Development

• Self-Contained Applications – A simple app

Application Development

• Self-Contained Applications – Packaging – Don’t forget app dependencies

Deployment

• That’s how you deploy

Application Monitoring• monitor your app – http://<driver-node>:4040

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Debugging

• Remote debugging – Enable Remote debugging

– Must be running on local[*]

Running on Yarn

• Why to run on Yarn? – Cluster resources – Schedulers – Security

Running on Yarn

• Standalone

Running on Yarn

• Yarn Architecture – Resource Manager– Node Manager– Application Master– Container

Running on Yarn

• Yarn Client Mode

Running on Yarn

• Yarn Cluster Mode

Running on Yarn

• Standalone vs Spark on Yarn

References[1] Apache Spark official site http://spark.apache.org/[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain

A Big Thank YouSpark it up

You got questions?

top related