spark for faster batch processing

22
View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Spark For Fast Batch Processing

Upload: edureka

Post on 14-Aug-2015

355 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Spark For Faster Batch Processing

View Apache Spark and Scalacourse details at www.edureka.co/apache-spark-scala-training

Spark For Fast Batch Processing

Page 2: Spark For Faster Batch Processing

Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2

Objectives

Let’s talk about:-

What is Big Data?

Associated Challenges

What is Spark?

Why Spark?

Spark Ecosystem

Spark With Hadoop

Spark in Industry

RDDs – A Quick Look

Spark Vs Map Reduce Performance –Demo

Page 3: Spark For Faster Batch Processing

Slide 3 www.edureka.co/big-data-and-hadoop

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization

What is Big Data?

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytes

processing

mobile

Big Data

Page 4: Spark For Faster Batch Processing

Slide 4 www.edureka.co/apache-spark-scala-training

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

VOLUME

Web logs

Images

Videos

Audios

Sensor Data

VARIETYVELOCITY VERACITY

Min Max Mean SD

4.3 7.9 5.84 0.83

2.0 4.4 3.05 0.43

0.1 2.5 1.20 0.76

Associated Challenges

Page 5: Spark For Faster Batch Processing

Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5

What is Spark?

Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it

easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.

Developed at UC Berkeley

Written in Scala , a Functional Programming Language that runs in a JMV

It generalize the Map Reduce framework

Page 6: Spark For Faster Batch Processing

Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6

Why Spark ?

Speed

Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk.

Ease of Use

Supports different languages for developing applications using Spark

Generality

Combine SQL, streaming, and complex analytics into one platform

Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud.

Page 7: Spark For Faster Batch Processing

Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7

Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass

computations and algorithms ( Machine learning etc.)

To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in

sequence

Each of those jobs was high-latency, and none could start until the previous job had finished completely

The Job output data between each step has to be stored in the local file system before the next step can begin

Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning

and Storm for streaming data processing)

Why Spark? -Map Reduce Limitations

Page 8: Spark For Faster Batch Processing

Slide 8 www.edureka.co/apache-spark-scala-training

Used for structured data. Can run unmodified hive queries on existing Hadoop deployment

Spark Core Engine

Aplha/Pre-alpha

Shark (SQL)

SparkStreaming(Streaming)

MLLib(Machine learning)

GraphX(Graph

Computation)

SparkR(R onSpark)

BlinkDB(ApproximateS

QL)

Enables analytical and interactive apps for live streaming data

An approximate query engine. To run over Core Spark Engine

Graph Computation engine (Similar to Graph)

Package for R language to enable R-users to leverage Spark power from R shell

Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce

Spark Ecosystem

Page 9: Spark For Faster Batch Processing

Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9

Spark Features

Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-memory data storage

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

It’s designed to be an execution engine that works both in-memory and on-disk

Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow

Provides concise and consistent APIs in Scala, Java and Python

Offers interactive shell for Scala and Python. This is not available in Java yet

Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)

Page 10: Spark For Faster Batch Processing

Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10

Spark Core

SparkStreaming

Spark Sql

Blink DB

MLlib Graph X Spark R

Spark Architecture

Cluster management ( Native Spark Cluster, YARN, MESOS )

Distributed storage ( HDFS, Cassandra, S3, HBase )

Page 11: Spark For Faster Batch Processing

Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11

Spark Advantages

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Easier APIs Python, Scala, Java

RDDs DAGs Unify Processing

Shark, MLStreaming, GraphX

Page 12: Spark For Faster Batch Processing

Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12

UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

Multiple data sources Multiple applications Multiple users

Reliability Multi-tenancy Security

Files Databases Semi-structured

Hadoop Advantages

Page 13: Spark For Faster Batch Processing

Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13

Spark + Hadoop

UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Operational Applications Augmented by In-Memory Performance

Page 14: Spark For Faster Batch Processing

Slide 14 www.edureka.co/apache-spark-scala-training

Spark in Industry

Page 15: Spark For Faster Batch Processing

Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15

Resilient Distributed Datasets – A Quick Look

RDD ( Resilient Distributed Data Sets )

Resilient – If data in memory is lost, It can be recreated

Distributed – Stored in memory across the cluster

Dataset – Initial data can come from a file or created programmatically.

RDDs are the fundamental unit of data in spark

Page 16: Spark For Faster Batch Processing

Slide 16 www.edureka.co/apache-spark-scala-trainingSlide 16

Resilient Distributed Datasets

Core concept of Spark framework.

RDDs can store any type of data.

Primitive Types : Integer, Characters, Boolean etc.Files : Text files, SequencFiles etc.

RDD is fault tolerance.

RDDs are immutable

Page 17: Spark For Faster Batch Processing

Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17

RDD supports two types of operations:

Transformation: Transformations don't return a single value, they return a new RDD.

Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.

Action: Action operation evaluates and returns a new value.

Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.

Resilient Distributed Datasets

Page 18: Spark For Faster Batch Processing

Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18

Spark Vs Map Reduce Performance -Demo

Page 19: Spark For Faster Batch Processing

Slide 19 www.edureka.co/apache-spark-scala-training

Course Topics

Module 1 » Introduction to Scala

Module 2» Scala Essentials

Module 3 » Traits and OOPs in Scala

Module 4 » Functional Programming in Scala

Module 5 » Introduction to Big Data and Spark

Module 6 » Spark Baby Steps

Module 7 » Playing with RDDs

Module 8» Spark with SQL- When Spark meets Hive

Page 20: Spark For Faster Batch Processing

Slide 20 www.edureka.co/apache-spark-scala-training

LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

Course Features

Page 21: Spark For Faster Batch Processing

Slide 21 www.edureka.co/apache-spark-scala-training

Questions

Page 22: Spark For Faster Batch Processing

Slide 22 www.edureka.co/apache-spark-scala-training