bdm26: spark summit 2014 debriefing

Spark Summit 2014

Debriefing

David Lauzon

Presented at Big Data Montreal #26 on July 8th 2014

Plan

● Spark Summit 2014 summary

● Tachyon

● BlinkDB

● Databricks Cloud

Disclaimer

I haven’t use Spark yet

I haven’t validated all the info gathered in this

presentation

Try it out for yourself :-)

Spark’s Role in the Big

Data Ecosystem

Matei Zaharia (CTO, Databricks)

“Spark is now the most active

project in the Hadoop ecosystem”

“The goal of Spark is to be a unified

platform and standard library for big

data apps”

native driver

What’s Next for BDAS?

Mike Franklin

(Director, UC Berkeley AMPLab)

LAYERSApplication

Data Processing

Resource

Management

Data

Management

BDAS Summary (1/2)

Spark Core General purpose low level low latency processing engine.

Supports: HDFS API, Amazon S3 API, and Hive metadata

Shark Replaces Hive’s execution engine from MapReduce by Spark

Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP

sockets

MLlib MLlib = low level machine library running on Spark.

MLbase (in dev) Competitor to Mahout, runs on top of MLlib.

GraphX (in dev) Enable users to interactively build, transform, and reason about

graph structured at scale

BDAS Summary (2/2)

BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response

Times on Very Large Data

SparkR (alpha) Run R on top of Spark

Tachyon A reliable in-memory distributed file system providing a HDFS

compatible API.

Can persist data to HDFS, Amazon S3, LocalFS, etc.

Mesos Cluster resource manager, multi-tenancy

Spark and the future of

big data applications

Eric Baldeschwieler (Tech Advisor)

Big Data Application Model

Spark’s current (v1.0) challenges

Better job scheduling tools

Increase focus on ETL

R bindings

Extend SparkSQL to run on more data stores

Add more machine learning algorithms

Basics: stability, profiling & debugging, error

reporting, logging, etc.

Spark’s current (v1.0) challenges

Better stability

Profiling & debugging

Error reporting

Logging

The Future of Spark

Patrick Wendell (Databricks)

Timeline

and:● join optimisations

● MLib: from 15 to 30 algorithms

● Core internal API for pluggable

implementations

The Emergence of the

Enterprise Data Hub

Mike Olson (Chief Strategy Officer,

Cloudera)

(a vision

of the future)

This means that sooner or later ...

Hadoop

MapReduce

Spark meets Genomics:

Helping Fight the Big C

with the Big D

David Patterson (AMP Lab, UC Berkeley)

SNAP: Scalable Nucleotide

Alignment Program

=> A new genome aligner based on Spark that

is 10-100X faster and simultaneously more

accurate than existing tools based on

MapReduce or other algorithms [1]

[1] https://amplab.cs.berkeley.edu/projects/snap/

SNAP helps save a life [1]

A teenager was hospitalized for 5 weeks

without successful diagnosis

He developed brain seizures and was placed in

a medically induced coma

With a sample of his spinal fluid and the use of

Snap, a rare infectious bacterium was found

Boy was treated, and discharged 4 weeks later

[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/

Databricks Update and

Announcing Databricks

Cloud

Ion Stoica (CEO, Databricks)

even RedHat Fedora

New: Databricks Cloud Platform

Databricks Platform

Databricks Workspace: Notebooks

Databricks Workspace: Dashboards

Databricks Cloud Demo

The following video extract integrates:

● Databricks Workspace

● Databricks Platform

● Spark Streaming

● Spark SQL

● Spark MLLib

Databricks Cloud Demo

14min extract:http://youtu.be/dJQ5lV5Tldw?t=26m57s

Full video:https://www.youtube.com/watch?v=dJQ5lV5Tldw

http://youtu.be/dJQ5lV5Tldw?t=26m57s

https://www.youtube.com/watch?v=dJQ5lV5Tldw

Databricks Cloud

Great tool for data scientists

Conclusion

Conclusion

Most interesting Spark related projects:

● SparkSQL

● BlinkDB

● Tachyon

● Databricks Cloud

bdm26: spark summit 2014 debriefing

Software

big data montreal

data storesadd

goal of spark

hadoop mapreduce spark

hdfs api

databricks cloud platform

multitenancy spark

hdfs compatible api