5 things one must know about spark!
TRANSCRIPT
www.edureka.co/apache-spark-scala-training
5 Things one must know about Spark!
www.edureka.co/apache-spark-scala-training
What will you learn today?
Spark In-Memory Processing
Streaming Support
Machine Learning and Graph
Spark DataFrame API
Spark's Integration with Hadoop
www.edureka.co/apache-spark-scala-training
Spark In-Memory Processing
www.edureka.co/apache-spark-scala-training
Spark Cut Down Read/Write I/O To Disk
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keep shuffling things in and out of disk.
www.edureka.co/apache-spark-scala-training
Spark is blazingly Fast
www.edureka.co/apache-spark-scala-training
Isn’t Spark In-Memory Only
But I have heard Spark is good for onlyin-memory processing?
www.edureka.co/apache-spark-scala-training
Spark : Best of both Worlds
It’s a common misconception Spark is only for in-memory processing. From its inception Spark was designed to be a general execution engine that works both in-memory and on-disk. Almost all Spark operators perform external operations when data does not fit in memory
www.edureka.co/apache-spark-scala-training
Streaming Support
www.edureka.co/apache-spark-scala-training
Spark Streaming
Used for processing the real-time streaming data.
It uses the DStream which is a series of RDDs, for processing the continuous real-time data.
Spark Streaming API closely matches that of the Spark Core
www.edureka.co/apache-spark-scala-training
Machine Learning and GraphImplementation with DAG
www.edureka.co/apache-spark-scala-training
Machine Learning
MLlib, a machine
learning library
Classification Regression Clustering Collaborative
filtering
Some of the algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering
www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
All jobs in spark comprise a series of operators and run on a set of data.
All the operators in a job are used to construct a DAG (Directed Acyclic Graph).
The DAG is optimized by rearranging and combining operators where possible.
www.edureka.co/apache-spark-scala-training
GraphX
Graph Algorithms
Page RankConnected
ComponentsTriangle
Counting
Component for graphs and graph-parallel computation
Extends the Spark RDD by introducing a new Graph abstraction
www.edureka.co/apache-spark-scala-training
Support for DataFrames
www.edureka.co/apache-spark-scala-training
DataFrame
Inspired by DataFrames in R and Python (Pandas).
DataFrames API is designed to make big data processing on tabular data easier.
DataFrame is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.
Can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.
www.edureka.co/apache-spark-scala-training
DataFrame features
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
APIs for Python, Java, Scala, and R
www.edureka.co/apache-spark-scala-training
Spark’s Integration with Hadoop
www.edureka.co/apache-spark-scala-training
Spark Execution Platforms
Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution modes
Standalone Mesos HDFS
www.edureka.co/apache-spark-scala-training
Spark in one Snapshot
www.edureka.co/apache-spark-scala-training
Spark Use Cases
Different companies are using Spark for solving various problems e.g. recommendation systems, business intelligence, fraud detection etc.
www.edureka.co/apache-spark-scala-training
Who is using Spark?
A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
www.edureka.co/apache-spark-scala-training
References
IBM backs Apache Spark for Big Data Analytics :
http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/
Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :
http://fortune.com/2015/09/09/cloudera-spark-mapreduce/
5 reasons to turn to Spark for Big Data Analytics :
http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html
www.edureka.co/apache-spark-scala-training
References
Spark new record for large scale sorting :
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
How eBay uses Spark to ignite Data Analytics :
http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/
Spark is fast on disk too :
https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/
www.edureka.co/apache-spark-scala-training
Thank You …
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours