apache spark: killer or savior of apache hadoop?

Apache Spark: ��killer or savior of Apache Hadoop?

Roman Shaposhnik Director of Open Source @Pivotal

(Twitter: @rhatr)

Who’s this guy?

•  Director of Open Source (building a team of OS contributors)

•  Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)

Shameless plug

http://manning.com/martella

Dearly beloved…

40 minute to figure out

Hadoop vs. Spark

Hadoop++ == Spark

Hadoop + Spark

Long, long time ago…

ASF Projects FLOSS Projects Pivotal Products

MapReduce

In a blink of an eye

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

GemFire XD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

Tachyon

A Spark view?

Sqoop Flume

management

Zookeeper

Command Center

GemFire XD

Hadoop UI

SolrCloud

Phoenix

HBase Spark

Streaming

GraphX

SpringXD

Tachyon

Principle #1

HDFS is the datalake

Your datacenter

server 1

server N

Hadoop’s view

MapReduce

server 1

server N

HDFS: decoupled storage

… MR

Anatomy of MapReduce

a 3 b 1 c 2

a 1 b 1 c 1

a 1 c 1 a 1

a 1 1 1 b 1 c 1 1

HDFS mappers reducers HDFS

Principle #2

MR is assembly language

MapReduce 1.0

Job Tracker

Task Tracker��(HDFS)

task1 task1 task1 task1 task1

task1 task1 task1 task1 taskN

YARN (AKA MR2.0)

Resource��Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

YARN (AKA MR2.0)

Resource��Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

Principle #3

MR: YARN + library

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Principle #4

$ grep –R | awk | sort …

Spark philosophy • Make life easy for Data Scientists

• Provide well documented and expressive APIs

• Powerful Domain Specific Libraries

• Easy integration with storage systems

• Caching to avoid data movement

• Well defined releases, stable API

Spark innovations • Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))

HadoopRDD��path = hdfs://

FilteredRDD��contains…

MappedRDD split…

Parallel operators

• map, reduce

• sample, filter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

How do I use it?

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Principle #5

Memory is the new disk

RDDs are the foundation

• SQL

• Graph

• ML

• Streaming

Spark SQL • Lib in Spark Core that models RDDs as rels.

• SchemaRDD

• Replaces Shark

• Lightweight with no code from Hive

• Import/Export into different storage formats

• Columnar storage (as in Shark)

Spark Streaming

• Extend Spark to do large scale stream processing

• Simple, batch like API with RDDs

• Single semantics for both real time and high latency

D-Streams

Streaming from Twitter

TwitterUtils.createStream(...)

.filter(_.getText.contains("Spark"))

.countByWindow(Seconds(5))

Spark GraphX

• Pregel (BSP) (formerly know as Bagel)

• Graph-centric modeling

• Unification of processing

• No more MR trickery

You killed Apache Giraph?

MLbase

• Machine Learning toolset

• MatLab for scale out computing

• Built on Spark Mlib

• Classification, Regression, Colab. Filtering, etc.

What is really happening?

Sqoop Flume

management

Zookeeper

Command Center

GemFire XD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

Tachyon

Principle #6

Spark: the ecosystem

May be its not so bad server 1

server N

But HDFS/YARN are safe?

HDFS, Ceph, S3, NAS, etc.

New HDFS

New YARN

What is *really* going on? • 2009 Research at UCB, written in Scala

• 2010 Open Sourced

• 2013 Accepted into Apache Incubator

• 2013 Databricks formed ($14M funding)

• 2014 Becomes TLP with ASF

• 2014 Spark 1.0 is out

• 2014 Databricks gets an extra $33M

Bigdata: brought to U by ASF

• >50% ML traffic

• 100-200 contributors across 25-35 companies

• More active than Hadoop

• Cross-pollination with other TLPs

Principle #7

Where Hadoop was ‘09

This is how hardening looks

What is Hadoop?

Hadoop != MR + HDFS

The ecosystem • Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Principle #8

Spark: an alternative backend

Spark is best for cloud

Principle #9

Memory is expensive

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Multi datacenter deployments

Hadoop Maturity

ETL Offload Accommodate massive ��

data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps

Build analytic-led applications impacting ��

top line revenue

Data-Driven Enterprise

App Dev and Operational Management on HDFS

Data Architecture

Pivotal HD on Pivotal CF

� Enterprise PaaS Management System

� Flexible multi-language ‘buildpack’ architecture

� Deployed applications enjoy built-in services

� On-Premise Hadoop as a Service

� Single cluster deployment of Pivotal HD

� Developers instantly bind to shared Hadoop Clusters

� Speeds up time-to-value

Pivotal’s view

Data Science Platform

Tachyon/Gem

Cluster Manager

Application

Stream Server

MPP SQL

Data Lake / HDFS / Virtual Storage

GemFireXD

...ETC

Hadoop HDFS Isilon

App Dev / Ops

MLbase Streaming

Legacy Systems

Legacy

Data Scientists Data Sources End Users

SparkSQL

Principle #10

The rumors of my death…

It will be called Hadoop

Sqoop Flume

management

Zookeeper

Command Center

GemFire with Tachyon

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

Spark recap

• Is it “Big Data” (Yes)

• Is it “Hadoop” (No)

• It’s one of those “in memory” things, right (Yes)

• JVM, Java, Scala (All)

• Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

Credits • Wikipedia and Dilbert.com

• Apache Software Foundation

• Scott Deeg

• Milind Bhandarkar

• Susheel Kaushik

• Mak Gokhale

Questions ?

apache spark: killer or savior of apache hadoop?

Software

savior string section

ted bundy (lady killer, the campus killer)

savior - james searing.pdf

pil - killer noodles killer colas killer medicines

what a savior

a savior talks!

my savior lives

hulu killer app or tv killer

apache httpd 2.4: the cloud killer app

one incredible savior

our coming savior, part 1: the wonder of a coming savior

blue star savior

sperm whale killer whale sperm whale killer whale

killer whale: study how do we study killer whales? ·...

savior angel

making your killer applications killer

from killer robot to killer product

beautiful savior ssatb

the discourse of the savior and the dance of the savior

using apache spark, apache kafka and apache...