5 things one must know about spark!

24
www.edureka.co/apache-spark-scala-training 5 Things one must know about Spark!

Upload: edureka

Post on 15-Apr-2017

391 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

5 Things one must know about Spark!

Page 2: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

What will you learn today?

Spark In-Memory Processing

Streaming Support

Machine Learning and Graph

Spark DataFrame API

Spark's Integration with Hadoop

Page 3: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark In-Memory Processing

Page 4: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark Cut Down Read/Write I/O To Disk

Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keep shuffling things in and out of disk.

Page 5: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark is blazingly Fast

Page 6: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Isn’t Spark In-Memory Only

But I have heard Spark is good for onlyin-memory processing?

Page 7: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark : Best of both Worlds

It’s a common misconception Spark is only for in-memory processing. From its inception Spark was designed to be a general execution engine that works both in-memory and on-disk. Almost all Spark operators perform external operations when data does not fit in memory

Page 8: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Streaming Support

Page 9: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark Streaming

Used for processing the real-time streaming data.

It uses the DStream which is a series of RDDs, for processing the continuous real-time data.

Spark Streaming API closely matches that of the Spark Core

Page 10: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Machine Learning and GraphImplementation with DAG

Page 11: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Machine Learning

MLlib, a machine

learning library

Classification Regression Clustering Collaborative

filtering

Some of the algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering

Page 12: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Cyclic Data Flows

All jobs in spark comprise a series of operators and run on a set of data.

All the operators in a job are used to construct a DAG (Directed Acyclic Graph).

The DAG is optimized by rearranging and combining operators where possible.

Page 13: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

GraphX

Graph Algorithms

Page RankConnected

ComponentsTriangle

Counting

Component for graphs and graph-parallel computation

Extends the Spark RDD by introducing a new Graph abstraction

Page 14: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Support for DataFrames

Page 15: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

DataFrame

Inspired by DataFrames in R and Python (Pandas).

DataFrames API is designed to make big data processing on tabular data easier.

DataFrame is a distributed collection of data organized into named columns.

Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.

Can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

Page 16: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

DataFrame features

Ability to scale from KBs to PBs

Support for a wide array of data formats and storage systems

State-of-the-art optimization and code generation through the spark SQL catalyst optimizer

Seamless integration with all big data tooling and infrastructure via spark

APIs for Python, Java, Scala, and R

Page 17: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark’s Integration with Hadoop

Page 18: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark Execution Platforms

Spark can leverage the resource negotiator of Hadoop framework i.e. YARN

Spark workloads can make use of Symphony scheduling policies and execute via YARN

Spark execution modes

Standalone Mesos HDFS

Page 19: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark in one Snapshot

Page 20: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Spark Use Cases

Different companies are using Spark for solving various problems e.g. recommendation systems, business intelligence, fraud detection etc.

Page 21: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Who is using Spark?

A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Page 22: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

References

IBM backs Apache Spark for Big Data Analytics :

http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/

Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :

http://fortune.com/2015/09/09/cloudera-spark-mapreduce/

5 reasons to turn to Spark for Big Data Analytics :

http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html

Page 23: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

References

Spark new record for large scale sorting :

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

How eBay uses Spark to ignite Data Analytics :

http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/

Spark is fast on disk too :

https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/

Page 24: 5 things one must know about spark!

www.edureka.co/apache-spark-scala-training

Thank You …

Questions/Queries/Feedback

Recording and presentation will be made available to you within 24 hours