20150716 introduction to apache spark v3

Presenter : Andrey Vykhodtsev

Andrey.vykhodtsev@si.ibm.com

*collective work, see slide credits

Two meetup groups

Close, but different topics

Ran by me

I don’t have to be a presenter all the time

Propose your agenda

Not a Big Data introduction

Visit our next Big Data Essentials meetup instead http://www.meetup.com/Big-Data-Developers-in-Slovenia/events/223871144/

Not for people without technical background (sorry)

Not a thorough use case discussion

Just a technical overview of technology for beginners

General purpose distributed computing engine suitable for large scale machine learning and data processing tasks

NOT SO GOOD GOOD

Not the first computing engine MapReduce

Not one of a kind Flink

Not so old (mature)

Developing very fast

Rapidly growing community

Backed by major vendors

Innovation

Designed for iterative data analysis on large scale (supersedes MR)

In-Memory Performance

Ease of Development

Combine Workflows

Unlimited Scale

Enterprise Platform

Wide Range of

Data Formats

A Big Data/DWH developer

A Data Scientist

An Analytics Architect

A CxO of IT company

Statistici

Business Analyst

Software Engineer

IT WORDS BUSINESS WORDS

Data processing/Transformation

Machine Learning

Social Network Analysis

Streaming/Microbatching

Segmentation

Campaign response prediction

Churn avoidance

CTR prediction

Behavioral analysis

Genomics

Open Source SystemML

Educate One Million Data Professionals

Establish Spark Technology Center

Founding Member of AMPLab

Contributing to the Core

Port many existing applications onto Spark

Develop applications using Spark

Distributed platform for thousands of nodes

Data storage and computation framework

Open source

Runs on commodity hardware

Flexible – everything is loosely coupled

Driving principals

Files are stored across the entire cluster

Programs are brought to the data, not the data to the program

Distributed file system (DFS) stores blocks across the whole cluster

Blocks of a single file are distributed across the cluster

A given block is typically replicated as well for resiliency

Just like a regular file system, the contents of a file is up to the application

Unlike a regular file system, you can ask it “where does each block of my file live?”

FILE BLOCK

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Hello World Bye World

Hello IBM

Content of Input Documents

Reduce (final output):

< Bye, 1> < IBM, 1> < Hello, 2> < World, 2>

Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>

Spark brings two significant value-adds: Bring to Map Reduce the same added value that databases (and

parallel databases) brought to query processing: Let the app developer focus on the WHAT (they need to ask) and let the

system figure out HOW (it should be done).

Enable faster higher level application development through higher level constructs and concepts: (RDD concept)

Let the system deal with performance (as part of the HOW) Leveraging memory (Bufferpools, Caching RDDs in memory)

Maintaining sets of dedicated worker processes ready to go (subagents in DBMS, Executors in Spark)

Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)

Be one general purpose engine for multiples types of workloads (SQL, Streaming, Machine Learning, etc…)

Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing Fast

Leverages aggressively cached in-memory distributed computing and dedicated

App Executor processes even when no jobs are running Faster than MapReduce

General purpose Covers a wide range of workloads Provides SQL, streaming and complex

analytics

Flexible and easier to use than Map Reduce Spark is written in Scala, an object oriented,

functional programming language Scala, Python and Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or

Logistic regression in Hadoop and Spark

Spark Stack

val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

WordCount

Spark is versatile and flexible:

Can run on YARN / HDFS but also standalone or on MESOS

Spark engine can be exploited from multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing

Normally you code stuff up in one of the languages

Python

I like Python, but in some cases it is slower

With DataFrames, no difference (more later)

One of the shells

Scala shell (spark-shell)

Python shell

Code it in the editor and submit with spark-submit

Use “notebook” (Jupyter, Zeppelin)

My preferred method. More later

Enable your IDE to run spark

PyCharm

IntelliJ IDEA

Jupytiter

Zeppelin Scala

Incubated

Many others Spark Notebook

Ispark

DataBricks Cloud

IBM Spark aaS

IBM DataScientist Workbench

Initialize context

Read data

Run stuff

Transformations

Actions

Caching

More later

GOOD STUFF NOT SO GOOD STUFF

Full API exposed

Concise language

Documentation is way better

Faster if you use plain RDDs

Build tools and dependency tracking

Not so many additional libraries compared to Python Pandas

Matplotlib

Harder to run in a “notebook”* *At the moment

Harder to learn

Scala Crash Course

Holden Karau, DataBricks

http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course.pdf

Martin Odersky’s “Functional Programming in Scala” course

Books Scala for Impatient

Scala by Example

GOOD STUFF NOT SO GOOD STUFF

Clean & clear language

Easy to learn

Lot of libraries Pandas

Scikit

matplotlib

Easy to run in a “notebook”

Slower Interpreted language

Not all API functions exposed Streaming

Some times behaves differently

I think coding in Java for Spark is terrible

But if you like it messy, there is nobody to stop you

A way to connect to spark engine

Initialized with all runtime parameters

For example, memory parameters

Resilient Distributed Dataset

An abstraction over a generic data collection

Integers

Strings

PairRDD : <key, value> pairs (support additional operations)

Single logical entity but under the hood is a distributed collection

Mokhtar Jacques Dirk

Cindy Dan Susan

Dirk Frank Jacques

Partition 1 Partition 2 Partition 3

You have to pay attention what kind of operation you are running

Transformation

Does not do anything until the action is called

Actions

Kick off computation

Results can be persisted to memory (cache) or to disk (more later)

Three methods for creation Distributing a collection of objects from the driver program

(using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”)) Loading an external dataset (file)

val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")

Transformation from another existing RDD val rddNumbers2 = rddNumbers.map(x=> x+1)

Transformations are lazy evaluations

Returns a pointer to the transformed RDD

Pair RDD (K,V) functions for MapReduce style transformations

Filter

flatMap

reduceByKey

sortByKey

See the doc for full list

Kick off the computation

Transformations are lazily evaluated

Collect()

Count()

Take()

Reduce()

First()

saveAsTextFile()

Each node stores any partitions of the cache that it computes in memory

Reuses them in other actions on that dataset (or datasets derived from it)

Future actions are much faster (often by more than 10x)

Two methods for RDD persistence: persist() and cache()

rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects

build operator DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of tasks

submit each stage as ready

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Cluster manager

Worker

execute tasks

store and serve blocks

Block manager

Threads

stage failed

DataBricks

SparkContext

Driver Program Cluster Manager

Worker Node

Executor

Task Task

Worker Node

Executor

Task Task

Distributed machine learning libraries

SparkSQL

DataFrames

GraphX

SparkR

Streaming

Read the Fine Manual

https://spark.apache.org/docs/latest/index.html

Take the course

BigData University https://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/

edX – edx.org search for Spark

If you’re stuck

Try the user lists : https://spark.apache.org/community.html

Questions?

Topic for the next meetup?

Your experiences?

Want to be a presenter?

Some slide and text graphics were borrowed from the following sources Vincent Poncet, IBM France

Jacques Roy, IBM US

Daniel Kikuchi , IBM US

Mokhtar Kandil , IBM US

DataBricks

Spark Docs

I completely lost track what slides I copied from which source. I apologize.

20150716 introduction to apache spark v3

data professionals

data scientist

iterative data analysis

big data introduction

data processing tasks

regular file system

big data essentials

single file

Engineering

integrating apache hive with kafka, spark, and...

running apache spark & apache zeppelin in production

apache spark 101

accelerator for apache spark functional specification ·...

teachyourself apache spark...hour 1 introducing apache...

apache spark

[@naukriengineering] apache spark

state of security: apache spark & apache zeppelin

apache spark 2.0

apache spark - courses€¦ · apache spark introduction to...

apache spark streaming

knime extension for apache spark installation guide ·...

introduction to cassandra • why spark - apache cassandra |...

spark sql | apache spark

apache spark meetup

apache spark and distributed programming - cs-e4110 ... ·...

budapest spark meetup - apache spark @enbrite.ly

apache spark operations

apache spark - yandex

performance-analyse von apache spark und apache...