introduction to spark

Introduction to Spark

Sriram and Amritendu

DOS Lab, IIT Madras

“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/licenses/by/4.0/

Motivation

• In Hadoop, programmer writes job using Map Reduce abstraction

• Runtime distributes work and handles fault-tolerance

Makes analysis of large-data sets easy and reliable

Emerging Class of Applications

Machine learning• K-means clustering

.

.

Graph Algorithms• Page-rank

.

.

DOS Lab, IIT Madras

Intermediate results are reused across multiple computations

Nature of the emerging class of applications

Iterative Computation

DOS Lab, IIT Madras

Problem with Hadoop MapReduce

HDFSR R R

Iteration 1W W W

HDFS

R R R

HDFS

W W WIteration 2

Results are written to HDFS

New job is launched for each iteration

Incurs substantial storage and job launch overheadsDOS Lab, IIT Madras

Can we do away with these overheads?

Persist intermediate results in memory

What if a node fails?

HDFSL L L

Iteration 1

Memory is 10-100X faster than disk/network Iteration 2

X

Challenge: how to handle faults efficiently?

W

R R R

W W

W W W

RR R

DOS Lab, IIT Madras

Approaches to handle faults

• ReplicationIssues:– Requires more storage– More network traffic

– Log the operation– Re-compute lost partitions

using lineage information

Master

W

M R

Replica 1

R

Replica 2

XCan tolerate ‘r-1’

failures

• Using LineageD1 D2 D3

C1 C2X

D2 D3C2

Issues:Recovery time can be high if re-computation is very costly– high iteration time– wide dependencies

Wide dependencies

DOS Lab, IIT Madras

Spark

• RDD – Resilient Distributed Datasets– Read-only, partitioned collection of records

– Supports only coarse-grained operations• e.g. map and group-by transformations, reduce action

– Uses lineage graph to recover from faults

D12

D11

D13

3 partitions

DOS Lab, IIT Madras

Val

Spark contd.

• Control placement of partitions of RDD– can specify number of partitions

– can partition based on a key in each record• useful in joins

• In-memory storage– Up to 100X speedup over Hadoop for iterative

applications

• Spark can run on Hadoop YARN and read files from HDFS

• Spark is coded using Scala

DOS Lab, IIT Madras

SCALA overview

• Functional programming meets object orientation

• “No side effects” aids concurrent programming

• Every variable is an object

• Every function is a value

DOS Lab, IIT Madras

Variables and Functions

var obj : java.lang.String = “Hello”var x = new A()

def square(x: Int) : Int={x * x

}Return

type

DOS Lab, IIT Madras

Execution of a function

scala> square(2)res0:Int = 4

scala-> square(square(6))res1:Int = 1296

def square(x: Int) : Int={x * x

}

DOS Lab, IIT Madras

Nested Functions

def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={

if (i <= 1)acc

elsefact(i - 1, i * acc)

}fact(i, 1)

}

DOS Lab, IIT Madras

Nested Functions

def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={

if (i <= 1)acc

elsefact(i - 1, i * acc)

} fact(i, 1)

}

DOS Lab, IIT Madras

Higher order map functions

val add = (x: Int) => x+1val lst = list(1,2,3)

lst.map(add) : list(2,3,4)lst.map(x => x+1) : list(2,3,4)lst.map( _ + 1) : list(2,3,4)

DOS Lab, IIT Madras

Defining Objects

object Example{ def main(args: Array[String]) {

val logData = sc.textFile(logFile, 2).cache()--------------

}}

Example.main( (“master”,”noOfMap”,”noOfReducer”) )

DOS Lab, IIT Madras

Spark: Filter transformation in RDD

val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a"))

Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest

Give me those lines which contains ‘a’

Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDD

DOS Lab, IIT Madras

Countval logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(

line =>line.contains("a"))numAs.count()

5Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest

DOS Lab, IIT Madras

Flatmap

val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" "))

Take each line, split based on space and give me the array

Here is a example of filter map ( Here, is, a, example, of, filter,map )

DOS Lab, IIT Madras

Wordcount Example in Spark

new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]")val counts = file.flatMap (line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)counts.saveAsTextFile("hdfs://[output_path]")

DOS Lab, IIT Madras

Limitations

• RDDs are not suitable for applications that require fine-grained updates

– e.g. web storage system

DOS Lab, IIT Madras

References

• http://www.slideshare.net/tpunder/a-brief-intro-to-scala

• Scala in depth by Joshua D. Suereth

• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica“Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012.

• Pictures:– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg

– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_RAM_Chip_Brand_New_Chip.jpg

DOS Lab, IIT Madras

http://www.slideshare.net/tpunder/a-brief-intro-to-scala











http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg



introduction to spark

Technology

iit madras val

scala dos lab

partitions dos lab

map dos lab

iit madras introduction

w r r r w w w w w rr

path dos lab

amritendu dos lab