introduction to spark
DESCRIPTION
This is a short introduction to spark presented at Exebit 2014 at IIT Madras.TRANSCRIPT
Introduction to Spark
Sriram and Amritendu
DOS Lab, IIT Madras
“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.
Motivation
• In Hadoop, programmer writes job using Map Reduce abstraction
• Runtime distributes work and handles fault-tolerance
Makes analysis of large-data sets easy and reliable
Emerging Class of Applications
Machine learning• K-means clustering
.
.
Graph Algorithms• Page-rank
.
.
DOS Lab, IIT Madras
Intermediate results are reused across multiple computations
Nature of the emerging class of applications
Iterative Computation
DOS Lab, IIT Madras
Problem with Hadoop MapReduce
HDFSR R R
Iteration 1W W W
HDFS
R R R
HDFS
W W WIteration 2
Results are written to HDFS
New job is launched for each iteration
Incurs substantial storage and job launch overheadsDOS Lab, IIT Madras
Can we do away with these overheads?
Persist intermediate results in memory
What if a node fails?
HDFSL L L
Iteration 1
Memory is 10-100X faster than disk/network Iteration 2
X
Challenge: how to handle faults efficiently?
W
R R R
W W
W W W
RR R
DOS Lab, IIT Madras
Approaches to handle faults
• ReplicationIssues:– Requires more storage– More network traffic
– Log the operation– Re-compute lost partitions
using lineage information
Master
W
M R
Replica 1
R
Replica 2
XCan tolerate ‘r-1’
failures
• Using LineageD1 D2 D3
C1 C2X
D2 D3C2
Issues:Recovery time can be high if re-computation is very costly– high iteration time– wide dependencies
Wide dependencies
DOS Lab, IIT Madras
Spark
• RDD – Resilient Distributed Datasets– Read-only, partitioned collection of records
– Supports only coarse-grained operations• e.g. map and group-by transformations, reduce action
– Uses lineage graph to recover from faults
D12
D11
D13
3 partitions
DOS Lab, IIT Madras
Val
Spark contd.
• Control placement of partitions of RDD– can specify number of partitions
– can partition based on a key in each record• useful in joins
• In-memory storage– Up to 100X speedup over Hadoop for iterative
applications
• Spark can run on Hadoop YARN and read files from HDFS
• Spark is coded using Scala
DOS Lab, IIT Madras
SCALA overview
• Functional programming meets object orientation
• “No side effects” aids concurrent programming
• Every variable is an object
• Every function is a value
DOS Lab, IIT Madras
Variables and Functions
var obj : java.lang.String = “Hello”var x = new A()
def square(x: Int) : Int={x * x
}Return
type
DOS Lab, IIT Madras
Execution of a function
scala> square(2)res0:Int = 4
scala-> square(square(6))res1:Int = 1296
def square(x: Int) : Int={x * x
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={
if (i <= 1)acc
elsefact(i - 1, i * acc)
}fact(i, 1)
}
DOS Lab, IIT Madras
Nested Functions
def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={
if (i <= 1)acc
elsefact(i - 1, i * acc)
} fact(i, 1)
}
DOS Lab, IIT Madras
Higher order map functions
val add = (x: Int) => x+1val lst = list(1,2,3)
lst.map(add) : list(2,3,4)lst.map(x => x+1) : list(2,3,4)lst.map( _ + 1) : list(2,3,4)
DOS Lab, IIT Madras
Defining Objects
object Example{ def main(args: Array[String]) {
val logData = sc.textFile(logFile, 2).cache()--------------
}}
Example.main( (“master”,”noOfMap”,”noOfReducer”) )
DOS Lab, IIT Madras
Spark: Filter transformation in RDD
val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a"))
Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest
Give me those lines which contains ‘a’
Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDD
DOS Lab, IIT Madras
Countval logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(
line =>line.contains("a"))numAs.count()
5Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest
DOS Lab, IIT Madras
Flatmap
val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" "))
Take each line, split based on space and give me the array
Here is a example of filter map ( Here, is, a, example, of, filter,map )
DOS Lab, IIT Madras
Wordcount Example in Spark
new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]")val counts = file.flatMap (line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)counts.saveAsTextFile("hdfs://[output_path]")
DOS Lab, IIT Madras
Limitations
• RDDs are not suitable for applications that require fine-grained updates
– e.g. web storage system
DOS Lab, IIT Madras
References
• http://www.slideshare.net/tpunder/a-brief-intro-to-scala
• Scala in depth by Joshua D. Suereth
• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica“Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012.
• Pictures:– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg
– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_RAM_Chip_Brand_New_Chip.jpg
DOS Lab, IIT Madras