new directions for spark in 2015 - spark summit · pdf filenew directions for spark in 2015...
TRANSCRIPT
![Page 1: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/1.jpg)
New Directions for Spark in 2015 Matei Zaharia March 18, 2015
![Page 2: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/2.jpg)
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
2
![Page 3: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/3.jpg)
0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
3
![Page 4: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/4.jpg)
4
On-Disk Sort Record: Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines 2013 Record: Hadoop
72 minutes
2014 Record: Spark
207 machines 23 minutes
![Page 5: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/5.jpg)
Major Additions in 2014
5
Spark SQL Java 8 syntax Python streaming …
GraphX Random forests Streaming MLlib
![Page 6: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/6.jpg)
6
New Directions in 2015
Data Science High-level interfaces similar
to single-machine tools
Platform Interfaces Plug in data sources
and algorithms
![Page 7: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/7.jpg)
7
DataFrames
Similar API to data frames in R and Pandas
Automatically optimized via Spark SQL
Out in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Run
ning
Tim
e
![Page 8: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/8.jpg)
8
Machine Learning Pipelines
High-level API inspired by SciKit-Learn
Featurization, evaluation, parameter search
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model DataFrame
![Page 9: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/9.jpg)
9
R Interface (SparkR)
Targeting Spark 1.4 (June)
Exposes DataFrames, RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
![Page 10: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/10.jpg)
10
New Directions in 2015
Data Science High-level interfaces similar
to single-machine tools
Platform Interfaces Plug in data sources
and algorithms
![Page 11: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/11.jpg)
11
External Data Sources
Platform API to plug smart data sources into Spark
Returns DataFrames usable in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}
![Page 12: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/12.jpg)
12
External Data Sources
Platform API to plug smart data sources into Spark
Returns DataFrames usable in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”
![Page 13: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/13.jpg)
13
Spark Packages
Community index of third party packages bin/spark-shell --packages databricks/spark-csv:0.2 spark-packages.org
![Page 14: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/14.jpg)
14
Spark Core
Spark Streaming
Spark SQL
MLlib GraphX
![Page 15: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/15.jpg)
15
Spark Core
DataFrames ML Pipelines
Spark Streaming
Spark SQL
MLlib GraphX
![Page 16: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/16.jpg)
16
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark Streaming
Spark SQL
MLlib GraphX
![Page 17: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/17.jpg)
17
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark Streaming
Spark SQL
MLlib GraphX
Packages
![Page 18: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/18.jpg)
18
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark Streaming
Spark SQL
MLlib GraphX
![Page 19: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/19.jpg)
19
Goal: unified engine across data sources, workloads and environments
![Page 20: New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}](https://reader030.vdocuments.us/reader030/viewer/2022021510/5ab821ac7f8b9ac10d8c9f47/html5/thumbnails/20.jpg)
20
Enjoy Spark Summit East!