what's new in spark 2?

28
What’s New in Spark 2? Eyal Ben Ivri

Upload: eyal-ben-ivri

Post on 06-Jan-2017

101 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: What's New in Spark 2?

What’s New in Spark 2?

Eyal Ben Ivri

Page 2: What's New in Spark 2?

In just two words…

Not Much

Page 3: What's New in Spark 2?

In just OVER two words…

Not Much, but let’s talk about it.

Page 4: What's New in Spark 2?

Let’s start from the beginning

• What is Apache Spark?• An open source cluster computing framework. • Originally developed at the University of California, Berkeley's AMPLab.• Aimed and designed to be a Big Data computational framework.

Page 5: What's New in Spark 2?

Spark Components

Data Sources

Spark Core (Batch Processing, RDD, SparkContext)

SparkSQL(DataFrame)

Spark Streaming

Spark MLlib(ML Pipelines) Spark GraphX Spark

Packages

Page 6: What's New in Spark 2?

Spark Components (v2)

Data Sources

Spark Core (Batch Processing, RDD, SparkContext)

Spark Streaming

Spark MLlib(ML Pipelines) Spark GraphX Spark

PackagesSpark SQL (SparkSession, DataSet)

Page 7: What's New in Spark 2?

Timeline

UC Berkeley’s AMPLab (2009)

Open Sourced (2010)

Apache Foundation

(2013)

Top-Level Apache

Project (Feb 2014)

Version 1.0 (May 2014)

World record in

large scale sorting (Nov

214)

Version 1.6 (Jan 2016)

Version 2.0.0 (Jul

2016)

Version 2.0.1 (Oct

2016, Current)

Page 8: What's New in Spark 2?

Version History (major changes)

1.0 – SparkSQL

(formally Shark project)

1.1 – Streaming support for

python

1.2 – Core engine

improvements. GraphX

graduates.

1.3 – DataFrame.

Python engine improvements.

1.4 – SparkR

1.5 – Bugs and Performance

1.6 – Dataset (experimental)

Page 9: What's New in Spark 2?

Spark 2.0.x

• Major notables:• Scala 2.11• SparkSession• Performance• API Stability• SQL:2003 support• Structured Streaming• R UDFs and Mllib algorithms implementation

Page 10: What's New in Spark 2?

API

• Spark doesn't like API changes• The good news:

• To migrate, you’ll have to perform little to no changes to your code.

• The (not so) bad news:• To benefit from all the

performance improvements, some old code might need more refactoring.

Page 11: What's New in Spark 2?

Programming API• Unifying DataFrame and Dataset:

• Dataset[Row] = DataFrame• SparkSession replaces

SqlContext/HiveContext• Both kept for backwards compatibility.

• Simpler, more performant accumulator API

• A new, improved Aggregator API for typed aggregation in Datasets

Page 12: What's New in Spark 2?

SQL Language

• Improved SQL functionalities (SQL 2003 support)

• Can now run all 99 TPC-DS queries

• The parser support ANSI-SQL as well as HiveQL

• Subquery support

Page 13: What's New in Spark 2?

SparkSQL new features

• Native CSV data source (Based on Databricks’ spark-csv package)

• Better off-heap memory management

• Bucketing support (Hive implementation)

• Performance performance performance

Page 14: What's New in Spark 2?

Demo Spark API

SparkSQL

Page 15: What's New in Spark 2?

Judges Ruling

Dataset was supposed to be the future like a 6

months ago

SQL 2003 is so 2003

The API lives on!

SQL 2003 is cool

Page 16: What's New in Spark 2?

Structured Streaming (Alpha)

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

You can express your streaming computation the same way you would express a batch computation on static data.

Page 17: What's New in Spark 2?

Structured Streaming (cont.)

You can use the Dataset / DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc.

The computation is executed on the same optimized Spark SQL engine.

Exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Page 18: What's New in Spark 2?

Windowing streams

How many vehicles entered each toll booth every 5 minutes?

val windowedCounts = cars.groupBy( window($"timestamp", ”5 minutes", ”5 minutes"), $”car").count()

Page 19: What's New in Spark 2?
Page 20: What's New in Spark 2?
Page 21: What's New in Spark 2?

Demo Structured Streaming

Page 22: What's New in Spark 2?

Judges Ruling

Still a micro-batching process SparkSQL is the future

Page 23: What's New in Spark 2?

Tungsten Project - Phase 2

Performance improvements

Heap Memory management

Connectors optimizations

Let’s Pop the hood

Page 24: What's New in Spark 2?

Project Tungsten

Bring Spark performance closer to the bare metal, through:

• Native memory Management

• Runtime code generation

Started @ Version 1.4

The cornerstone that enabled the Catalyst engine

Page 25: What's New in Spark 2?

Project Tungsten - Phase 2

Whole stage code generationa. A technique that blends state-of-the-art from modern compilers and MPP databases.

b. Gives a performance boost of up to x9 faster

c. Emit optimized bytecode at runtime that collapses the entire query into a single function

d. Eliminating virtual function calls and leveraging CPU registers for intermediate data

Page 26: What's New in Spark 2?

Project Tungsten - Phase 2

Optimized input / outputa. Caching for Dataframes is based on Parquet

b. Faster Parquet reader

c. Google Gueva is OUT

d. Smarter HadoopFS connector

you have to be running on DataFrame / Dataset

Page 27: What's New in Spark 2?

Overall Judges Ruling

I Want to complain but i don’t know what about!!

Internal performance improvements aside, this feels more like Spark 1.7

I like flink...

All is good

SparkSQL is for sure the future of spark

The competition has done well for Spark

Page 28: What's New in Spark 2?

Thank you (Questions?)Long live Spark (and Flink)Eyal Ben Ivri

https://github.com/eyalbenivri/spark2demo