bdm26: spark summit 2014 debriefing
TRANSCRIPT
![Page 1: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/1.jpg)
Spark Summit 2014
Debriefing
David Lauzon
Presented at Big Data Montreal #26 on July 8th 2014
![Page 2: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/2.jpg)
Plan
● Spark Summit 2014 summary
● Tachyon
● BlinkDB
● Databricks Cloud
![Page 3: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/3.jpg)
Disclaimer
I haven’t use Spark yet
I haven’t validated all the info gathered in this
presentation
Try it out for yourself :-)
![Page 4: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/4.jpg)
Spark’s Role in the Big
Data Ecosystem
Matei Zaharia (CTO, Databricks)
![Page 5: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/5.jpg)
“Spark is now the most active
project in the Hadoop ecosystem”
![Page 6: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/6.jpg)
![Page 7: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/7.jpg)
“The goal of Spark is to be a unified
platform and standard library for big
data apps”
![Page 8: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/8.jpg)
native driver
![Page 9: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/9.jpg)
What’s Next for BDAS?
Mike Franklin
(Director, UC Berkeley AMPLab)
![Page 10: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/10.jpg)
LAYERSApplication
Data Processing
Resource
Management
Data
Management
![Page 11: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/11.jpg)
BDAS Summary (1/2)
Spark Core General purpose low level low latency processing engine.
Supports: HDFS API, Amazon S3 API, and Hive metadata
Shark Replaces Hive’s execution engine from MapReduce by Spark
Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP
sockets
MLlib MLlib = low level machine library running on Spark.
MLbase (in dev) Competitor to Mahout, runs on top of MLlib.
GraphX (in dev) Enable users to interactively build, transform, and reason about
graph structured at scale
![Page 12: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/12.jpg)
BDAS Summary (2/2)
BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response
Times on Very Large Data
SparkR (alpha) Run R on top of Spark
Tachyon A reliable in-memory distributed file system providing a HDFS
compatible API.
Can persist data to HDFS, Amazon S3, LocalFS, etc.
Mesos Cluster resource manager, multi-tenancy
![Page 13: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/13.jpg)
![Page 14: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/14.jpg)
![Page 15: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/15.jpg)
![Page 16: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/16.jpg)
Spark and the future of
big data applications
Eric Baldeschwieler (Tech Advisor)
![Page 17: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/17.jpg)
Big Data Application Model
![Page 18: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/18.jpg)
Spark’s current (v1.0) challenges
Better job scheduling tools
Increase focus on ETL
R bindings
Extend SparkSQL to run on more data stores
Add more machine learning algorithms
Basics: stability, profiling & debugging, error
reporting, logging, etc.
![Page 19: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/19.jpg)
Spark’s current (v1.0) challenges
Better stability
Profiling & debugging
Error reporting
Logging
![Page 20: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/20.jpg)
The Future of Spark
Patrick Wendell (Databricks)
![Page 21: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/21.jpg)
Timeline
and:● join optimisations
● MLib: from 15 to 30 algorithms
● Core internal API for pluggable
implementations
![Page 22: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/22.jpg)
The Emergence of the
Enterprise Data Hub
Mike Olson (Chief Strategy Officer,
Cloudera)
![Page 23: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/23.jpg)
![Page 24: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/24.jpg)
(a vision
of the future)
![Page 25: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/25.jpg)
This means that sooner or later ...
Hadoop
MapReduce
![Page 26: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/26.jpg)
![Page 27: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/27.jpg)
Spark meets Genomics:
Helping Fight the Big C
with the Big D
David Patterson (AMP Lab, UC Berkeley)
![Page 28: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/28.jpg)
SNAP: Scalable Nucleotide
Alignment Program
=> A new genome aligner based on Spark that
is 10-100X faster and simultaneously more
accurate than existing tools based on
MapReduce or other algorithms [1]
[1] https://amplab.cs.berkeley.edu/projects/snap/
![Page 29: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/29.jpg)
SNAP helps save a life [1]
A teenager was hospitalized for 5 weeks
without successful diagnosis
He developed brain seizures and was placed in
a medically induced coma
With a sample of his spinal fluid and the use of
Snap, a rare infectious bacterium was found
Boy was treated, and discharged 4 weeks later
[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/
![Page 30: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/30.jpg)
Databricks Update and
Announcing Databricks
Cloud
Ion Stoica (CEO, Databricks)
![Page 31: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/31.jpg)
even RedHat Fedora
![Page 32: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/32.jpg)
New: Databricks Cloud Platform
![Page 33: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/33.jpg)
Databricks Platform
![Page 34: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/34.jpg)
Databricks Workspace: Notebooks
![Page 35: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/35.jpg)
Databricks Workspace: Dashboards
![Page 36: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/36.jpg)
Databricks Cloud Demo
The following video extract integrates:
● Databricks Workspace
● Databricks Platform
● Spark Streaming
● Spark SQL
● Spark MLLib
![Page 37: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/37.jpg)
Databricks Cloud Demo
14min extract:http://youtu.be/dJQ5lV5Tldw?t=26m57s
Full video:https://www.youtube.com/watch?v=dJQ5lV5Tldw
![Page 38: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/38.jpg)
Databricks Cloud
Great tool for data scientists
![Page 39: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/39.jpg)
Conclusion
![Page 40: BDM26: Spark Summit 2014 Debriefing](https://reader030.vdocuments.us/reader030/viewer/2022032616/55a684911a28ab3c498b47d0/html5/thumbnails/40.jpg)
Conclusion
Most interesting Spark related projects:
● SparkSQL
● BlinkDB
● Tachyon
● Databricks Cloud