a look ahead at spark 2.0
TRANSCRIPT
A look ahead at Spark 2.0
Reynold Xin @rxin2016-03-30, Strata Conference
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …
Today’s Talk
Looking back last 12 months
Looking forward to Spark 2.0• Project Tungsten, Phase 2• Structured Streaming• Unifying DataFrame & Dataset
Best resource for learning Spark
A slide from 2013 …
Programmability
5WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark
What is Spark?
Unified engine across data workloads and platforms
…
SQLStreaming ML Graph Batch …
2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption
“Spark is the Taylor Swiftof big data software.”
- Derrick Harris, Fortune
Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business Intelligence
Diverse Runtime EnvironmentsHOW RESPONDENTS ARE
RUNNING SPARK
51%on a public cloud
MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)
48% 40% 11%Standalone mode YARN Mesos
Cluster Managers
Spark 2.0
Next major release, coming in May
Builds on all we learned in past 2 years
Versioning in Spark
In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs
1 .6.0Patch version (only bug fixes)
Major version (may change APIs)
Minor version (adds APIs / features)
Major Features in 2.0
Tungsten Phase 2speedups of 5-10x
Structured Streamingreal-time engine
on SQL/DataFrames
Unifying Datasetsand DataFrames
Datasets & DataFrames
API foundation for the future
Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
Examplecase class User(name: String, id: Int)case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”)).map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets
Benefits
Simpler to understand• Only kept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams
Long-Term
RDD will remain the low-level API in Spark
Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames
Structured Streaming
How do we simplify streaming?
Integration Example
Streaming engine
Stream(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
ProcessingBusiness logic change & new ops
(windows, sessions)
Complex Programming Models
OutputHow do we define
output over time & correctness?
DataLate arrival, varying distribution over time, …
The simplest way to perform streaming analyticsis not having to reason about streaming.
Spark 2.0Infinite DataFrames
Spark 1.3Static DataFrames
Single API !
Structured Streaming
High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks
Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models
See Michael/TD’s talks tomorrow for a deep dive!
Tungsten Phase 2
Can we speed up Spark by 10X?
Demo
Run a join on a large table with 1 billion records and a small table with 1000 records
In Spark 1.6, took 60+ seconds.
In Spark 2.0, took ~3 seconds.
Scan
Filter
Project
Aggregate
select count(*) from store_saleswhere ss_item_sk = 1000
Volcano Iterator Model
Standard for 30 years: almost all databases do it
Each operator is an “iterator” that consumes records from its input operator
class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {
found = predicate(child.fetch())}return found
}
def fetch(): InternalRow = {child.fetch()
}…
}
What if we hire a college freshman toimplement this query in Java in 10 mins?
select count(*) from store_saleswhere ss_item_sk = 1000
var count = 0for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1
}}
Volcano model30+ years of database research
college freshmanhand-written code in 10 mins
vs
Volcano 13.95 millionrows/sec
collegefreshman
125 millionrows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (or cache)
3. No loop unrolling, SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compiler loop unrolling, SIMD, pipelining
Take advantage of all the information that is known after query compilation
Scan
Filter
Project
Aggregate
long count = 0;for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1;
}}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a general purpose execution engine; performance as if hand built system just to run your query
DatabricksCommunity Edition
Best place to try & learn Spark.
Today’s talk
Spark has been growing explosively
Spark 2.0 doubles down on what made Spark attractive:• elegant APIs• cutting-edge performance
Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/
Thank you.@rxin