big data processing with spark and scala

19
Big Data Processing With Spark and Scala http:// www.edureka.co/apache-spark-scala-training

Upload: edureka

Post on 10-Aug-2015

533 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Big Data Processing with Spark and Scala

Big Data Processing With Spark and Scala

http://www.edureka.co/apache-spark-scala-training

Page 2: Big Data Processing with Spark and Scala

Slide 2Slide 2 http://www.edureka.co/apache-spark-scala-training

What is Big Data?

What is Spark?

Why Spark?

Spark Ecosystem

A note about Scala

Why Scala?

MapReduce vs Spark

Hello Spark!

Objectives of this Session

Page 3: Big Data Processing with Spark and Scala

Slide 3Slide 3 http://www.edureka.co/apache-spark-scala-training

Big Data

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data setsso large and complex that it becomes difficult toprocess using on-hand database managementtools or traditional data processing applications

The challenges include capture, curation,storage, search, sharing, transfer, analysis, andvisualization

cloud

tools

statistics

No SQL

compression

storage

support

database

analyze

information

terabytes

processing

mobile

Big Data

Page 4: Big Data Processing with Spark and Scala

Slide 4Slide 4 http://www.edureka.co/apache-spark-scala-training

What is Spark?

Apache Spark is a general-purpose cluster in-memory computing system

Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs

Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more..

High Level APIs

High Level Tools

More…

Page 5: Big Data Processing with Spark and Scala

Slide 5Slide 5 http://www.edureka.co/apache-spark-scala-training

Why Spark?

Cluster Manager

Deployment

via YARN

The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.

Page 6: Big Data Processing with Spark and Scala

Slide 6Slide 6 http://www.edureka.co/apache-spark-scala-training

Why Spark?

Polyglot Scala

Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).

Page 7: Big Data Processing with Spark and Scala

Slide 7Slide 7 http://www.edureka.co/apache-spark-scala-training

Why Spark?

A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive.

100x faster than for certain applications.

Page 8: Big Data Processing with Spark and Scala

Slide 8Slide 8 http://www.edureka.co/apache-spark-scala-training

Why Spark?

Provides powerful caching and disk persistence capabilities

Interactive Data Analysis

Faster Batch

Iterative Algorithms

Real-Time Stream Processing

Faster Decision-Making

Page 9: Big Data Processing with Spark and Scala

Slide 9Slide 9 http://www.edureka.co/apache-spark-scala-training

Spark Community is Super Active!

Page 10: Big Data Processing with Spark and Scala

Slide 10Slide 10 http://www.edureka.co/apache-spark-scala-training

Spark Ecosystem

Spark Core Engine

Aplha/Pre-alpha

Shark (SQL)

SparkStreaming(Streaming)

MLLib(Machine learning)

GraphX(Graph

Computation)

SparkR(R on Spark)

BlindDB(Approximate

SQL)

Page 11: Big Data Processing with Spark and Scala

Slide 11Slide 11 http://www.edureka.co/apache-spark-scala-training

Spark Ecosystem (Contd.)

Used for structured data. Can run unmodified hive queries on existing Hadoop deployment.

Spark Core Engine

Aplha/Pre-alpha

Shark (SQL)

SparkStreaming(Streaming)

MLLib(Machine learning)

GraphX(Graph

Computation)

SparkR(R on Spark)

BlindDB(Approximate

SQL)

Enables analytical and interactive apps for live streaming data.

An approximate query engine. To run over Core Spark Engine.

Graph Computation engine.(Similar to Giraph)

Package for R language to enable R-users to leverage Spark power from R shell.

Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.

Page 12: Big Data Processing with Spark and Scala

Slide 12Slide 12 http://www.edureka.co/apache-spark-scala-training

A Note on Scala

Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way

Scala supports both Object Oriented Programming and Functional Programming

Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka

» All examples of Spark in class will be covered in Scala

» Scala would be covered before Spark coverage as part of course!

Page 13: Big Data Processing with Spark and Scala

Slide 13Slide 13 http://www.edureka.co/apache-spark-scala-training

Why Scala?

Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits

Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation

Seamlessly integrated with Java

Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc

Page 14: Big Data Processing with Spark and Scala

Slide 14Slide 14 http://www.edureka.co/apache-spark-scala-trainingSlide 14

If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly

Hadoop works on Batch processing, hence response time is high

Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n

Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n

InputData

ProcessingData

InputData

ProcessingData

InputData

ProcessingData

Input Data

Processing Data using MR

Time Lag

Real Time Analytics

Page 15: Big Data Processing with Spark and Scala

Slide 15Slide 15 http://www.edureka.co/apache-spark-scala-trainingSlide 15

Real Time Analytics – Accepted Way

Streaming Data

Storing

Page 16: Big Data Processing with Spark and Scala

Slide 16Slide 16 http://www.edureka.co/apache-spark-scala-trainingSlide 16

14 sec

0.6 sec

MapReduce vs Spark

Page 17: Big Data Processing with Spark and Scala

Slide 17 http://www.edureka.co/apache-spark-scala-training

Spark Demo!

Spark Demo!

Page 18: Big Data Processing with Spark and Scala

Slide 18 http://www.edureka.co/apache-spark-scala-training

Questions?

Page 19: Big Data Processing with Spark and Scala