getting started with apache spark

Apache SparkStream Programming and Distributed Data Processing

Habib Ahmed BhuttoSenior Software Engineer

iConnect360

Outline

• What’s Spark• Why Spark• Fundamental concepts• Cluster Deployment • Spark Streaming• Application Development• Deployment • Application Monitoring • Debugging

What’s Spark

• Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab,

University of California, Berkeley as sub-project of Hadoop

• Now it’s Apache’s

Why Spark

• Speed • Ease of use• Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud)

• Fault Tolerance • Integration • Deployment

Fundamental Concepts

• What exactly it does

Hadoop execution flow

Spark execution flow

• How exactly it does

• Resilient Distributed Dataset (RDD)– Abstraction – Immutable – Partitioned collection– Operated on in parallel

• RDD Operations – Actions – Transformations

• Spark Context

• Driver Program• Cluster Manager• Worker Node• Executer• Job • Stage• Task• Application Jar• Deploy Mode

Cluster Deployment

• Standalone• Amazon EC2 • Apache Mesos • Hadoop Yarn

Cluster Deployment

• Master page to monitor your cluster – http://<server-url>:8080

Spark Streaming

• How it works

Spark Streaming

• How it works internally

Spark Streaming

• Discretised Streams– Abstraction – Continuous Stream– Input data/ processed data – Series of RDDs

Spark Streaming

• Any operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming

• Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as

distributed SQL query engine.

Application Development

• Spark-Shell – Code in Scala with instant execution

• Self-Contained Applications – Dependencies /Linking Libraries

• Self-Contained Applications – A simple app

• Self-Contained Applications – Packaging – Don’t forget app dependencies

Deployment

• That’s how you deploy

Application Monitoring• monitor your app – http://<driver-node>:4040

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Debugging

• Remote debugging – Enable Remote debugging

– Must be running on local[*]

Running on Yarn

• Why to run on Yarn? – Cluster resources – Schedulers – Security

Running on Yarn

• Standalone

Running on Yarn

• Yarn Architecture – Resource Manager– Node Manager– Application Master– Container

Running on Yarn

• Yarn Client Mode

Running on Yarn

• Yarn Cluster Mode

Running on Yarn

• Standalone vs Spark on Yarn

References[1] Apache Spark official site http://spark.apache.org/[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain

A Big Thank YouSpark it up

You got questions?

getting started with apache spark

Software

spark sql | apache spark

apache spark 2.0

apache spark introduction

started with apache spark - happiest minds · pdf...

managed solutions apache spark® · apache spark® apache...

[@naukriengineering] apache spark

integrating apache hive with kafka, spark, and...

apache spark intro

apache spark

started with apache spark

getting started with apache spark - kelvin tan 陳添發 ·...

apache spark - lmu

apache spark - courses€¦ · apache spark introduction to...

using apache spark

getting started with apache spark -...

apache spark operations

r + apache spark

spark tutorial @ dao · spark tutorial @ dao ... to get...

started with-apache-spark

getting started in apache spark and flink (with scala) -...