![Page 1: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/1.jpg)
Using Apache SparkPRESENTER: SNEHA CHALLA
VENUE: GOOGLE, P IT TSBURGH
DATE: AUGUST 25 TH, 2015
![Page 2: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/2.jpg)
What is Apache Spark?Spark is an open source computation engine built on top of the popular Hadoop
Distributed File System (HDFS). Fast and general cluster computing engine for large scale data processing
Efficient Usable
Offers In memory computing
DAG Execution EngineUp to 10× faster on disk,
100× in memory 2-5× less code
Rich APIs in Java, Scala, Python
Interactive shell
![Page 3: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/3.jpg)
Why Spark? Runs Everywhere
• Standalone Mode (private cluster)
• Apache Mesos – Cluster manager
• Hadoop YARN• Amazon EC2 (prepared
deployment)
![Page 4: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/4.jpg)
Spark CommunitySpark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide variety of companies.
MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA
More than 150 Contributors in the past One Year
25 + Companies Contributing and it’s growing.
Spark was designed to both make traditional Map Reduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. Spark can be used to build fast end-to-end Machine Learning workflows.
![Page 5: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/5.jpg)
SPARK COMMUNITY
![Page 6: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/6.jpg)
Elephant in the room- Map Reduce
![Page 7: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/7.jpg)
Why Apache Spark?
![Page 8: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/8.jpg)
Spark VS Map Reduce Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
![Page 9: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/9.jpg)
Why Spark?
![Page 11: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/11.jpg)
Spark Installation• Extract compressed folder spark-1.3.0-binhadoop2.4
• From terminal, go to spark-1.3.0-bin-hadoop2.4/
bin
• Run pyspark
• Run rdd = sc.parallelize([0, 1, 2]);
rdd.map(lambda x: x*x).collect()
• Get result [0,1,4]
• It’s that easy!
Windows users might need to download and run additional winutils.exe for smooth running of applications
Download winutils here http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path.
and add to $HADOOP_HOME/bin• Download a bigger zip (1.9GB) fromhttp://bit.ly/1FpZAXH
![Page 12: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/12.jpg)
Interactive Shell & Spark ContextInteractive Shell:
The Fastest Way to Learn Spark
Available in Python and Scala
Runs as an application on an existing Spark Cluster.
OR Can run locally
Spark Context:
Main entry point to Spark functionality
Available in shell as variable scIn standalone programs, you’d make your
own by calling SC object(see later for details)
![Page 13: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/13.jpg)
Key Concepts – RDD Distributional Model
RDD – Resilient Distributed DatasetPrograms are written in terms of transformations on these Distributed Datasets
• RDD = Resilient Distributed Database• Transformations convert one RDD into another• No actual calculation• Actions force calculation of result• Lazy evaluation
![Page 14: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/14.jpg)
RDD’s - MotivationRDDs are motivated by two types of applications that current data flow systems handle inefficiently:
1) Iterative algorithms - Common in graph applications and Machine Learning
2) Interactive Data Mining Tools
To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.
However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized computations.
![Page 15: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/15.jpg)
Spark ArchitectureBasic Abstraction: RDDs
Immutability
Laziness
Transformations
![Page 16: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/16.jpg)
Programming ModelTwo types of operations on an RDD:• transformations• actionsTransformations are lazily evaluated - they are not executed when you issue the command.RDDs are recomputed when an action is executed.
![Page 17: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/17.jpg)
Data distribution/partitioningRDD - read only collection of objects that are partitioned across a set of machines.
![Page 18: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/18.jpg)
RDD Computational Model
• Operators on RDDs form a directed acyclic graph.• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
![Page 19: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/19.jpg)
FIRST STEP IN DATA ANALYSIS :Create an RDDRead data from a text file on local machine , S3 or HDFS into RDD.
Give life to an RDD using Spark Context
# Convert a python collection to an RDD# Turn a Python collection into an RDD>sc.parallelize ([7, 8, 9])# Load text file from local FS, HDFS, or S3>sc.textFile(“textfile.txt”)>sc.textFile(“directory/*.txt”)>sc.textFile(“hdfs://namenode:9000/path/file”)
![Page 20: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/20.jpg)
Transformations – RDD: mapPass each element of an RDD through a function
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)
![Page 21: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/21.jpg)
Transformations – RDD :reduce>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.reduce(add)
![Page 22: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/22.jpg)
RDD Transformations: Reduce By Key>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])
>>>result_rdd = rdd.reduceByKey(add)
![Page 23: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/23.jpg)
Some more RDD Transformationsrdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
>>> rdd = sc.parallelize([2,3,4])
sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on transformation
[1,1,1,2,2,3]
rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.
>>> rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x%2 ==0).collect()
[2,4]
![Page 24: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/24.jpg)
Some more RDD TransformationssortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by given keyfunc
>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]
rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).
rdd.cache() : Cache RDD in memory for repeated use
countByKey(self):
rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])
rdd.countByKey().items()
join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with matching keys in self and others.
![Page 25: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/25.jpg)
Setting the level of parallelismAll the pair RDD operations take an optional second parameter for number of tasks.
> rdd.reduceByKey(lambda x, y: x + y, 5)
>rdd.groupByKey(5)
>rdd.join(pageViews, 5)
![Page 26: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/26.jpg)
Some RDD ActionsRDD Transformations are lazily evaluated . Actions kick off computation on transformations.Eg: Collect(), glom() etc rdd.collect() : Return RDD content as a listrdd = sc.parallelize([1,2,3], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.glom().collect(): [1, 4, 9] rdd.glom().collect(): rdd = sc.parallelize([0,1,2], 3)rdd2 = rdd.map(lambda x: x*x)rdd2.collect(): [[0], [1], [4]] saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system. take(n) :Return an array with the first n elements of the dataset. first() : return the first element of the dataset. Similar to take(1)
![Page 27: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/27.jpg)
In Map Reduce you get only two operators – map and reduce. Whereas Spark offers 80+ operations!
Automatic parallelization of workflows on SPARK. In Spark a whole series of individual tasks is expressed as a single program flow that is lazily evaluated., so that system has a complete picture of the execution graph.
![Page 28: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/28.jpg)
Word Countfrom pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
![Page 29: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/29.jpg)
Fault tolerant - PersistentRDDs track lineage information that can be used to efficiently re compute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“\t”)[2])
Spark will persist or cache RDD slices in memory on each node during operations.You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
![Page 30: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/30.jpg)
Spark Libraries
![Page 31: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/31.jpg)
MLlib Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.
Shipped with Spark since version 0.8
35 contributors
Highlights include:
Basic statistics - summary statistics ,correlation and stratified sampling
hypothesis testing, random data generation
Linear models of regression (logistic and linear regression, SVM’s)
Naive Bayes and Decision Tree classifiers, ensemble of trees
Collaborative Filtering with ALS
K-Means clustering and Gaussian Mixture.
Stochastic gradient descent
SVD (singular value decomposition) and PCA
Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
![Page 32: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/32.jpg)
Running a Spark ApplicationCommand: submit-spark <python_file_path>
Let’s see the implementation of
1) K-Means
2) Logistic Regression
![Page 33: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/33.jpg)
K-Means# Import the required pyspark functions
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
from pyspark import SparkContext
sc =SparkContext()
data = sc.textFile("C:\Users\snehachalla\Downloads\spark-1.4.1-bin-hadoop2.4\spark-1.4.1-bin-hadoop2.4\bin\kmeans_data.txt")
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
![Page 34: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/34.jpg)
K-Means (Cont..)# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
![Page 35: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/35.jpg)
Logistic Regression
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
# Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData)
# Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size)))))
# Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr)
![Page 36: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/36.jpg)
Spark UIRun Spark in local mode (pyspark) .
Spark UI is at http://localhost:4040
You will be able to see the RDD sizes and Identify slow running tasks.
![Page 37: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/37.jpg)
MOVING TO A CLUSTER – EC2
![Page 38: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/38.jpg)
Setting up a EMR ClusterIf your data is too large to compute on your local machine - then you’re in the right
place. An easy way to get Spark running is with EC2.
Create an account on aws.amazon.com
Get a keypair from aws Console: This is the security for your instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName
Create EMR instance and configure the nodes:
https://console.aws.amazon.com/console/home?region=us-east-1#
Launch EMR instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
![Page 39: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/39.jpg)
EMR Clusterhttps://console.aws.amazon.com/s3/home?region=us-east-1
Data can be uploaded to the bucket on Amazon S3 .
![Page 40: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/40.jpg)
For more info on Spark Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
Spark Summit: http://spark-summit.org
Github: https://github.com/apache/spark
Mailing lists: [email protected],[email protected]
Python API documentation. http://spark.apache.org/docs/latest/api/python/
![Page 41: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/41.jpg)
Questions?
THANK YOU
![Page 42: Apache spark sneha challa- google pittsburgh-aug 25th](https://reader031.vdocuments.us/reader031/viewer/2022030318/58f0f5661a28ab482e8b456b/html5/thumbnails/42.jpg)
Referenceshttp://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf
http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python