big data processing with apache spark

103
BIG DATA PROCESSING WITH APACHE SPARK December 9, 2015 LBS College of Engineering www.sarithdivakar.info | www.csegyan.org

Upload: sarith-divakar

Post on 16-Apr-2017

1.379 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Big data processing with apache spark

BIG DATA PROCESSING WITH APACHE SPARK

December 9, 2015

LBS College of Engineering

www.sarithdivakar.info | www.csegyan.org

Page 2: Big data processing with apache spark

WHAT IS BIG DATA?

Terabytes of Data

Petabytes of Data

Exabytes of Data

Yottabytes of Data

Brontobytes of Data

Geobytes of Data

Page 3: Big data processing with apache spark

WHERE BIG DATA COMES FROM?

Huge amount of data is created everyday!

It comes from Us!

No digitized process becomes digitized

Digital India Programmee to transform India to a digitally empowered

society and knowledge economy

Page 4: Big data processing with apache spark

EXAMPLES OF DIGITIZATION

Online banking

Online shopping

E-learning

Emails

Social medias

Decrease in cost of storage & data capture technology make up new era of data revolution

Page 5: Big data processing with apache spark

TRENDS IN BIG DATA

Digitalization of virtually everything: e.g. One’s personal life

Page 6: Big data processing with apache spark

DATA TYPES

StructuredDatabase, Data warehouse, Enterprise systems

UnstructuredAnalog data, GPS tracking, Audio/Video streams, Text

files

Semi-StructuredXML, Email, EDI

Page 7: Big data processing with apache spark

KEY ENABLERS OF BIG DATA

Increase in storage capacities

Increase in processing power

Availability of Data

Page 8: Big data processing with apache spark

FEATURES OF BIG DATA GENERATED

Digitally generated

Passively produced

Automatically collected

Geographically or temporarily trackable

Continuously analyzed

Page 9: Big data processing with apache spark
Page 10: Big data processing with apache spark

DIMENSIONS OF BIG DATA

Volume: Every minute 72 hours of videos are uploaded onYouTube

Variety: Excel tables & databases (Structured), Pure text,photo, audio, video, web, GPS data, sensor data, documents,sms, etc. New data formats for new applications

Velocity: Batch processing not possible as data is streamed.

Veracity/variability: Uncertainty inherent within some type ofdata

Value: Economic/business value of different data may vary

Page 11: Big data processing with apache spark

CHALLENGES IN BIG DATA

Capture

Storage

Search

Sharing

Transfer

Analysis

Visualization

Page 12: Big data processing with apache spark

NEED FOR BIG DATA ANALYTICS

Big Data needs to be captured, stored, organized and analyzed

It is Large & ComplexCannot manage with current methodologies or data

mining toolsTHEN NOW

Data warehousing, Datamining & Databasetechnologies

Did not analyze email, PDF and Video files

Worked with huge amount of data Analyzing semi structured and Un structured data

Prediction based on data Access and store all huge data created

Page 13: Big data processing with apache spark

BIG DATA ANALYTICS

Big Data analytics refers to tools and methodologies that aim to transform massive quantities of raw data into “data about data” for analytical purposes.Discovery of meaningful

patterns in data

Used for decision making

Page 14: Big data processing with apache spark

EXCAVATING HIDDEN TREASURES FROM BIG DATA

Insights into data can provide business advantage

Some key early indications can mean fortunes to business

More precise analysis with more data

Integrate Big Data with traditional data: Enhance business intelligence analysis

Page 15: Big data processing with apache spark
Page 16: Big data processing with apache spark

UNSTRUCTURED DATA TYPES

Email and other forms of electronic communication

Web based content(Click streams, social media)

Digitized audio and video

Machine generated data(RFID, GPS, sensor-generated data, log files) and IoT

Page 17: Big data processing with apache spark

APPLICATIONS OF BIG DATA ANALYSIS

Business: Customer personalization, customer needs

Technology: Reduce process time

Health: DNA mining to detect hereditary diseases

Smart cities: Cities with good economic development and high quality of life could be analyzed

Oil and Gas: Analyzing sensor generated data for production optimization, cost management risk management, healthy and safe drilling

Telecommunications: Network analytics and optimization from device, sensor and GPS to enhance social and promotion opportunities

Page 18: Big data processing with apache spark

OPPORTUNITIES BIG DATA OFFERS

Early warning

Real-time awareness

Real-time feedback

Page 19: Big data processing with apache spark

CHALLENGES IN BIG DATA

Heterogeneity and incompleteness

Scale

Timeliness

Privacy

Human collaboration

Page 20: Big data processing with apache spark

BIG DATA AND CLOUD: CONVERGING TECHNOLOGIES

Big data: Extracting value out of “variety, velocity and volume” from unstructured information available

Cloud: On demand, elastic, scalable pay per use self service model

Page 21: Big data processing with apache spark

ANSWER THESE BEFORE MOVING TO BIG DATA ANALYSIS

Do you have an effective big data problem?

Can the business benefit from using Big Data?

Do your data volumes really require these distributed mechanisms?

Page 22: Big data processing with apache spark

TECHNOLOGY TO HANDLE BIG DATA

Google was the first company to effectively use big data

Engineers at google created massively distributed systems

Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages

Page 23: Big data processing with apache spark
Page 24: Big data processing with apache spark

FIRST GENERATION OF DISTRIBUTED SYSTEMS

Proprietary

Custom Hardware and software

Centralized data

Hardware based fault recovery

Eg: Teradata, Netezza etc

Page 25: Big data processing with apache spark

SECOND GENERATION OF DISTRIBUTED SYSTEMS

Open source

Commodity hardware

Distributed data

Software based fault recovery

Eg : Hadoop, HPCC

Page 26: Big data processing with apache spark

APACHE HADOOP

Apache Hadoop is a framework that allows for thedistributed processing of large data sets across clustersof commodity computers using a simple programmingmodel.

Page 27: Big data processing with apache spark

HADOOP – KEY CHARACTERISTICS

Page 28: Big data processing with apache spark

HADOOP CORE COMPONENTS

Page 29: Big data processing with apache spark

HDFS ARCHITECTURE

Page 30: Big data processing with apache spark

SECONDARY NAMENODE

Page 31: Big data processing with apache spark

HADOOP CLUSTER ARCHITECTURE

Page 32: Big data processing with apache spark

HADOOP ECOSYSTEM

Page 33: Big data processing with apache spark

HADOOP CLUSTER MODES

Page 34: Big data processing with apache spark

MAP REDUCE PROGRAMMING

Page 35: Big data processing with apache spark

MAP REDUCE FLOW

Page 36: Big data processing with apache spark

EXISTING HADOOP CUSTOMERS

Page 37: Big data processing with apache spark

HADOOP VERSIONS

Page 38: Big data processing with apache spark

WHY WE NEED NEW GENERATION?

Lot has been changed from 2000

Both hardware and software gone through changes

Big data has become necessity now

Let’s look at what changed over decade

Page 39: Big data processing with apache spark

CHANGES IN HARDWARE

State of hardware in 2000 State of hardware now

Disk was cheap so disk was primary source of data

RAM is the king

Network was costly so data locality RAM is primary source of data and we use disk for fallback

RAM was very costly Network is speedier

Single core machines were dominant Multi core machines are commonplace

Page 40: Big data processing with apache spark

SHORTCOMINGS OF SECOND GENERATION

Batch processing is primary objective

Not designed to change depending upon use cases

Tight coupling between API and run time

Do not exploit new hardware capabilities

Too much complex

Page 41: Big data processing with apache spark

MAPREDUCE LIMITATIONS

If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence.

Each of those jobs have high-latency, and none could start until the previous job had finished completely.

The Job output data between each step has to be stored in the distributed file system before the next step can begin.

Hence, this approach tends to be slow due to replication & disk storage.

Page 42: Big data processing with apache spark

HADOOP VS SPARK

HADOOP SPARK

Stores data on disk Sores data in memory (RAM)

Commodity hardware can be utilized Need high end systems with greater RAM

Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD)

Speed of processing is less due to disk read write

100x faster than Hadoop

Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high.

Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc

Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.Runs on Hadoop, Cloud, Mesos or standalone

Page 43: Big data processing with apache spark

THIRD GENERATION DISTRIBUTED SYSTEMS

Handle both batch processing and real time

Exploit RAM as much as disk

Multiple core aware

Do not reinvent the wheel

They use

HDFS for storage

Apache Mesos / YARN for distribution

Plays well with Hadoop

Page 44: Big data processing with apache spark

APACHE SPARK

Open source Big Data processing framework

Apache Spark started as a research project at UC Berkeley in the AMPLab(Now Databricks), which focuses on big data analytics.

Open sourced in early 2010.

Many of the ideas behind the system are presented in various research papers.

Page 45: Big data processing with apache spark

SPARK TIMELINE

Page 46: Big data processing with apache spark

SPARK FEATURES

Spark gives us a comprehensive, unified framework

Manage big data processing requirements with a variety of data sets Diverse in nature (text data, graph data etc)

Source of data (batch v. real-time streaming data).

Spark lets you quickly write applications in Java, Scala, or Python.

Page 47: Big data processing with apache spark

DIRECTED ACYCLIC GRAPH (DAG)

Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern.

It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

Page 48: Big data processing with apache spark

UNIFIED PLATFORM FOR BIG DATA APPS

Page 49: Big data processing with apache spark

WHY UNIFICATION MATTERS?

Good for developers : One platform to learn

Good for users : Take apps every where

Good for distributors : More apps

Page 50: Big data processing with apache spark

UNIFICATION BRINGS ONE ABSTRACTION

All different processing systems in spark share same abstraction called RDD

RDD is Resilient Distributed Dataset

As they share same abstraction you can mix and match different kind of processing in same application

Page 51: Big data processing with apache spark

SPAM DETECTION

Page 52: Big data processing with apache spark

RUNS EVERYWHERE

You can spark on top any distributed system

It can run on

Hadoop 1.x

Hadoop 2.x

Apache Mesos

It’s own cluster

It’s just a user space

library

Page 53: Big data processing with apache spark

SMALL AND SIMPLE

Apache Spark is highly modular

The original version contained only 1600 lines of scala code

Apache Spark API is extremely simple compared Java API of M/R

API is concise and consistent

Page 54: Big data processing with apache spark

SPARK ARCHITECTURE

Page 55: Big data processing with apache spark

DATA STORAGE

Spark uses HDFS file system for data storage purposes.

It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.

Page 56: Big data processing with apache spark

API

The API provides the application developers to createSpark based applications using a standard API interface.Spark provides API for Scala, Java, and Pythonprogramming languages.

Page 57: Big data processing with apache spark

RESOURCE MANAGEMENT

Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN

Page 58: Big data processing with apache spark

SPARK RUNNING ARCHITECTURE

Page 59: Big data processing with apache spark

SPARK RUNNING ARCHITECTURE

Connects to a cluster manager which allocate resources across applications

Acquires executors on cluster nodes– worker processes to run computations and store data

Sends appcode to the executors

Sends tasks for the executors to run

Page 60: Big data processing with apache spark

SPARK RUNNING ARCHITECTURE

sc = new SparkContext

f = sc.textFile(“…”)

f.filter(…)

.count()

...

Your program

Spark client(app master)

Spark worker

HDFS, HBase, …

Block manager

Task threads

RDD graph

Scheduler

Block tracker

Shuffle tracker

Clustermanager

Page 61: Big data processing with apache spark

SCHEDULING PROCESS

RDD Objects

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of tasks

submit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

Threads

Task

stagefailed

Page 62: Big data processing with apache spark

RDD - RESILIENT DISTRIBUTED DATASET

Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel

A big collection of data having following properties Immutable

Lazy evaluated

Cacheable

Type inferred

Page 63: Big data processing with apache spark

RDD - RESILIENT DISTRIBUTED DATASET –TWO TYPES

Parallelized collections – take an existing Scala collection and run functions on it in parallel

Hadoop datasets / files – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop

Page 64: Big data processing with apache spark

SPARK COMPONENTS & ECOSYSTEM

Spark driver (context)

Spark DAG scheduler

Cluster management systems YARN Apache Mesos

Data sources In memory HDFS No SQL

Page 65: Big data processing with apache spark

ECOSYSTEM OF HADOOP & SPARK

Page 66: Big data processing with apache spark

CONTRIBUTORS PER MONTH TO SPARK

Page 67: Big data processing with apache spark

SPARK – STARK OVERFLOW ACTIVITIES

Page 68: Big data processing with apache spark

IN MEMORY

In Spark, you can cache hdfs data in main memory of worker nodes

Spark analysis can be executed directly on in memory data

Shuffling also can be done from in memory

Fault tolerant

Page 69: Big data processing with apache spark

INTEGRATION WITH HADOOP

No separate storage layer

Integrates well with HDFS

Can run on Hadoop 1.0 and Hadoop 2.0 YARN

Excellent integration with ecosystem projects like

Apache Hive, HBase etc

Page 70: Big data processing with apache spark

MULTI LANGUAGE API

Written in Scala but API is not limited to it

Offers API in

Scala

Java

Python

You can also do SQL using SparkSQL

Page 71: Big data processing with apache spark

SPARK – OPEN SOURCE ECOSYSTEM

Page 72: Big data processing with apache spark

SPARK SORT RECORD

Page 73: Big data processing with apache spark

PYTHON EXAMPLES

Page 74: Big data processing with apache spark

PYTHON EXAMPLES

Page 75: Big data processing with apache spark

WRITE

f = open('demo.txt','r')data = f.read()print(data)

f = open('demo.txt','a')f.write('I am trying to write a file')f.close()

PYTHON EXAMPLES

READ

Page 76: Big data processing with apache spark

PYTHON EXAMPLES

Page 77: Big data processing with apache spark

PYTHON EXAMPLES

Page 78: Big data processing with apache spark

RDD CREATION – FROM COLLECTIONS

A = range(1,100000)print(A)

raw_data = sc. parallelize(A)

raw_data.count()

raw_data.take(5)

Creating a RDD from a Collection

Creating a Collection

To view the sample data

Count the number of lines in the loaded files

Page 79: Big data processing with apache spark

RDD CREATION – FROM FILES

Getting the data files

import urllibf = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv")

Count the number of lines in the loaded files

data_file = "./tv.csv"raw_data = sc.textFile(data_file)

Creating a RDD from a file

raw_data.count()

To view the sample data

raw_data.take(5)

Page 80: Big data processing with apache spark

IMMUTABILITY

Immutability means once created it never changes

Big data by default immutable in nature

Immutability helps to Parallelize Caching

const int a = 0 //immutable

int b = 0; // mutable

b ++ // in place (Updation)

c = a + 1 (Copy)

Immutability is about value not about reference

Page 81: Big data processing with apache spark

IMMUTABILITY IN COLLECTIONS

Mutable Immutable

var collection = [1,2,4,5]for ( i = 0; i<collection.length; i++) {collection[i]+=1;}Uses loop for updatingcollection is updated in place

val collection = [1,2,4,5]val newCollection = collection.map(value=> value +1)Uses transformation for changeCreates a new copy of collection. Leavescollection intact

Page 82: Big data processing with apache spark

CHALLENGES OF IMMUTABILITY

Immutability is great for parallelism but not good for space

Doing multiple transformations result inMultiple copies of data

Multiple passes over data

In big data, multiple copies and multiple passes will have poor performance characteristics.

Page 83: Big data processing with apache spark

LET’S GET LAZY

Laziness means not computing transformation till it’s need

Laziness defers evaluation

Laziness allows separating execution from evaluation

Page 84: Big data processing with apache spark

LAZINESS AND IMMUTABILITY

You can be lazy only if the underneath data is immutable

You cannot combine transformation if transformation has side effect

So combining laziness and immutability gives better performance and distributed processing.

Page 85: Big data processing with apache spark

CHALLENGES OF LAZINESS

Laziness poses challenges in terms of data type

If laziness defers execution, determining the type of the variable becomes challenging

If we can’t determine the right type, it allows to have semantic issues

Running big data programs and getting semantics errors are not fun.

Page 86: Big data processing with apache spark

TRANSFORMATIONS

Transformations are the operations on RDD that return new RDD

By using the map transformation in Spark, we can apply a function to every element in our RDD

Collect will get all the elements in the RDD into memory to work with them

csv_data = raw_data.map(lambda x : x.split(“,”))

all_data = csv_data.collect()all_datalen(all_data)

Page 87: Big data processing with apache spark

SET OPERATIONS ON RDD

Spark support many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets

Union of RDDs doesn't remove duplicates

a=[1,2,3,4,5]b=[1,2,3,6]dist_a = sc.parallelize(a)dist_b = sc.parallelize(b)substract_data = dist_a.subtract(dist_b)substract_data.take(10)union_data=dist_a.union(dist_b)union_data.take(10)[1, 2, 3, 4, 5, 1, 2, 3, 6]distinct_data=union_data.distinct()distinct_data.take(10)[2, 4, 6, 1, 3, 5]

Page 88: Big data processing with apache spark

KEY VALUE PAIRS - RDD

Spark provides specific functions to deal with RDDs which elements are key/value pairs

They are commonly used for grouping and aggregations

data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50’]raw_data = sc.parallelize(data)raw_data.collect()key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1])))grouped_data = key_value.reduceByKey(lambda x,y:x+y)grouped_data.collect()grouped_data.keys().collect()grouped_data.values().collect()sorted_data = grouped_data.sortByKey()sorted_data.collect()

Page 89: Big data processing with apache spark

CACHING

Immutable data allows you to cache data for long time

Lazy transformation allows to recreate data on failure

Transformations can be saved also

Caching data improves execution engine performance

raw_data.cache()raw_data.collect()

Page 90: Big data processing with apache spark

SAVING YOUR DATA

saveAsTextFile(path) is used for storing the RDD inside your harddisk

Path is a directory and spark will output the multiple files under that directory. It allows the spark to write the output from the multiple nodes

raw_data.saveAsTextFile('opt')

Page 91: Big data processing with apache spark

SPARK EXECUTION MODEL

Create DAG of RDDs to represent computation

Create logical execution plan for DAG

Schedule and execute individual tasks

Page 92: Big data processing with apache spark

STEP 1: CREATE RDDS

Page 93: Big data processing with apache spark

DEPENDENCY TYPES

union

groupByKey

join with inputs notco-partitioned

join with inputs co-

partitioned

map, filter

“Narrow” deps: “Wide” (shuffle) deps:

Page 94: Big data processing with apache spark

STEP 2: CREATE EXECUTION PLAN

Pipeline as much as possible

Split into “stages” based on need to reorganize data

Page 95: Big data processing with apache spark

STEP 3: SCHEDULE TASKS

Split each stage into tasks

A task is data + computation

Execute all tasks within a stage before moving on

Page 96: Big data processing with apache spark

SPARK SUBMIT

from pyspark import SparkContext

sc = SparkContext( 'local', 'pyspark')

raw_data = sc.textFile("./bigdata.txt")

shows = raw_data.map(lambda line: (line.split(',')[4],1))

shows.take(5)

Page 97: Big data processing with apache spark

STEP 3: SCHEDULE TASKS

Page 98: Big data processing with apache spark

WHO ARE USING SPARK

Page 99: Big data processing with apache spark
Page 100: Big data processing with apache spark

SPARK INSTALLATION

INSTALL JDK

sudo apt-get install openjdk-7-jdk

INSTALL SCALA

sudo apt-get install scala

INSTALLING MAVEN

wget http://mirrors.sonic.net/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

tar -zxf apache-maven-3.3.3-bin.tar.gz

sudo cp -R apache-maven-3.3.3 /usr/local

sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn

mvn –v

Page 101: Big data processing with apache spark

SPARK INSTALLATION

INSTALLING GIT

sudo apt-get install git

CLONE SPARK PROJECT FROM GITHUB

git clone https://github.com/apache/spark.git

cd spark

BUILD SPARK PROJECT

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package

For starting spark cluster - ./sbin/start-all.sh

For Starting shell ./bin/pyspark

Page 102: Big data processing with apache spark

REFERENCES

1. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT

2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, AnkurDave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.

3. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/

4. “Apache Hadoop”, https://hadoop.apache.org/

5. “Apache Spark”, http://spark.apache.org/

6. “Spark: Cluster Computing with Working Sets”. Matei Zaharia, MosharafChowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.

Page 103: Big data processing with apache spark

CREDITS

Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg, SOE, CUSAT

Nithink K Anil, Quantiph, Mumbai, Maharashtra, India

Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT