spark summit eu 2015: revolutionizing big data in the enterprise with spark

Revolutionizing Big Data in the Enterprise with SparkIon StoicaOctober 28, 2015

We Have Seen a Lot

Worked with 100s companies to run Spark in production over five years

Collaborate with all major Hadoop and Big Data vendors

2

How Does Spark Change Enterprise Big Data?

• Unifying data sources

• Unifying data processing

3

4

Unifying Data Sources

Need to process data from• Multiple sources• Different data stores and locations • Different formats

Traditional solutions: ETL data into data warehouse, …

Traditional Data Warehouses

ETL

Slow to access and combine data

Data Warehouse

6

Just-In-Time (JIT) Data Warehouse

Process data in place or stream it• No need to wait for data to be

ETLed

7

JIT Data Warehouse

ETL

Data Warehouse

Process data in place or stream it• No need to wait for data to be

ETLed

Cache data in memory or SSDs

8

JIT Data Warehouse

Low latency and easy to combine data: value!

Analogy

9

Stream/cache &Play

Download &Play

Analogy

10

ETL & Query

Data Source A

ETL

Data Warehouse

Data Source B

Data Source B

Data Source A

Data Source B

Data Source B

Stream/Cache + Query

Top-3 Media Company

Data sources• Traditional data warehouse: Customer transaction and profile data • S3: Clickstream and historical logs• Elasticsearch: User-submitted reviews and comments• Kafka: Streaming online event data

Build Spark-based JIT Data Warehouse to perform real-time analytics

11

12

Unifying Data Processing

Unified support for• Batch• Streaming• ML/Graphs• …

13

Spark: Unified Engine

GraphXMLlib

Core

Spark Streaming

Spark SQL SparkR

Easy to manage, learn, and combine functionality

Analogy

First cellularphones

Unified device(smartphone)

Specializeddevices

Better Games Better GPSBetter Phone

Analogy

Batch processing Unified systemSpecialized systems

Real-timeanalytics

Instant fraud detection

Better Apps

Large On-line Service Company

Leverages• Interactive query processing• ML

and combines data from S3, Redshift, and HBase to provide • data analytics for product management team• advanced predictive analytics to deliver new services (e.g.,

customized inventory displays tailored to each user)

16

17

Demo

Demo Setting

18

MLlib

Core

Spark Streaming

Spark SQL

HDFS RedShift

spark summit eu 2015: revolutionizing big data in the enterprise with spark

Software