spark for faster batch processing

View Apache Spark and Scalacourse details at www.edureka.co/apache-spark-scala-training

Spark For Fast Batch Processing

www.edureka.co/apache-spark-scala-trainingSlide 2

Objectives

Let’s talk about:-

What is Big Data?

Associated Challenges

What is Spark?

Why Spark?

Spark Ecosystem

Spark With Hadoop

Spark in Industry

RDDs – A Quick Look

Spark Vs Map Reduce Performance –Demo

www.edureka.co/big-data-and-hadoop

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization

What is Big Data?

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytes

processing

mobile

Big Data

www.edureka.co/apache-spark-scala-training

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

VOLUME

Web logs

Images

Videos

Audios

Sensor Data

VARIETYVELOCITY VERACITY

Min Max Mean SD

4.3 7.9 5.84 0.83

2.0 4.4 3.05 0.43

0.1 2.5 1.20 0.76

Associated Challenges

http://www-01.ibm.com/software/data/bigdata/


What is Spark?

Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it

easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.

Developed at UC Berkeley

Written in Scala , a Functional Programming Language that runs in a JMV

It generalize the Map Reduce framework


Why Spark ?

Speed

Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk.

Ease of Use

Supports different languages for developing applications using Spark

Generality

Combine SQL, streaming, and complex analytics into one platform

Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud.


Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass

computations and algorithms ( Machine learning etc.)

To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in

sequence

Each of those jobs was high-latency, and none could start until the previous job had finished completely

The Job output data between each step has to be stored in the local file system before the next step can begin

Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning

and Storm for streaming data processing)

Why Spark? -Map Reduce Limitations


Used for structured data. Can run unmodified hive queries on existing Hadoop deployment

Spark Core Engine

Aplha/Pre-alpha

Shark (SQL)

SparkStreaming(Streaming)

MLLib(Machine learning)

GraphX(Graph

Computation)

SparkR(R onSpark)

BlinkDB(ApproximateS

QL)

Enables analytical and interactive apps for live streaming data

An approximate query engine. To run over Core Spark Engine

Graph Computation engine (Similar to Graph)

Package for R language to enable R-users to leverage Spark power from R shell

Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce

Spark Ecosystem


Spark Features

Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-memory data storage

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

It’s designed to be an execution engine that works both in-memory and on-disk

Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow

Provides concise and consistent APIs in Scala, Java and Python

Offers interactive shell for Scala and Python. This is not available in Java yet

Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)


Spark Core

SparkStreaming

Spark Sql

Blink DB

MLlib Graph X Spark R

Spark Architecture

Cluster management ( Native Spark Cluster, YARN, MESOS )

Distributed storage ( HDFS, Cassandra, S3, HBase )


Spark Advantages

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Easier APIs Python, Scala, Java

RDDs DAGs Unify Processing

Shark, MLStreaming, GraphX


UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

Multiple data sources Multiple applications Multiple users

Reliability Multi-tenancy Security

Files Databases Semi-structured

Hadoop Advantages


Spark + Hadoop

UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Operational Applications Augmented by In-Memory Performance


Spark in Industry


Resilient Distributed Datasets – A Quick Look

RDD ( Resilient Distributed Data Sets )

Resilient – If data in memory is lost, It can be recreated

Distributed – Stored in memory across the cluster

Dataset – Initial data can come from a file or created programmatically.

RDDs are the fundamental unit of data in spark


Resilient Distributed Datasets

Core concept of Spark framework.

RDDs can store any type of data.

Primitive Types : Integer, Characters, Boolean etc.Files : Text files, SequencFiles etc.

RDD is fault tolerance.

RDDs are immutable


RDD supports two types of operations:

Transformation: Transformations don't return a single value, they return a new RDD.

Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.

Action: Action operation evaluates and returns a new value.

Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.

Resilient Distributed Datasets


Spark Vs Map Reduce Performance -Demo


Course Topics

Module 1 » Introduction to Scala

Module 2» Scala Essentials

Module 3 » Traits and OOPs in Scala

Module 4 » Functional Programming in Scala

Module 5 » Introduction to Big Data and Spark

Module 6 » Spark Baby Steps

Module 7 » Playing with RDDs

Module 8» Spark with SQL- When Spark meets Hive


LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

Course Features


Questions

spark for faster batch processing

Technology