introduction to spark (intern event presentation)

15
Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015

Upload: databricks

Post on 15-Apr-2017

2.216 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Introduction to Spark (Intern Event Presentation)

Introduction to Spark

Matei Zaharia Databricks Intern Event, August 2015

Page 2: Introduction to Spark (Intern Event Presentation)

What is Apache Spark?

Fast and general computing engine for clusters

Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

Page 3: Introduction to Spark (Intern Event Presentation)

About Databricks

Founded by creators of Spark in 2013

Offers a hosted cloud service built on Spark •  Interactive workspace with notebooks, dashboards, jobs

Page 4: Introduction to Spark (Intern Event Presentation)

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Cont

ribut

ors

Contributors / Month to Spark

Community Growth

Most active open source project in big data

Page 5: Introduction to Spark (Intern Event Presentation)

Spark Programming Model

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Page 6: Introduction to Spark (Intern Event Presentation)

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines  =  spark.textFile(“hdfs://...”)  

errors  =  lines.filter(lambda  s:  s.startswith(“ERROR”))  

messages  =  errors.map(lambda  s:  s.split(‘\t’)[2])  

messages.cache()  Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda  s:  “MySQL”  in  s).count()  

messages.filter(lambda  s:  “Redis”  in  s).count()  

.  .  .  

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Page 7: Introduction to Spark (Intern Event Presentation)

Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Iterative algorithm used in machine learning

Page 8: Introduction to Spark (Intern Event Presentation)

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Page 9: Introduction to Spark (Intern Event Presentation)

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 10: Introduction to Spark (Intern Event Presentation)

Higher-Level Libraries

//  Load  data  using  SQL  points  =  ctx.sql(“select  latitude,  longitude  from  tweets”)  

//  Train  a  machine  learning  model  model  =  KMeans.train(points,  10)  

//  Apply  it  to  a  stream  sc.twitterStream(...)      .map(lambda  t:  (model.predict(t.location),  1))      .reduceByWindow(“5s”,  lambda  a,  b:  a  +  b)  

Page 11: Introduction to Spark (Intern Event Presentation)

Demo

Page 12: Introduction to Spark (Intern Event Presentation)

Over 1000 production users, clusters up to 8000 nodes

Many talks online at spark-summit.org

Spark Community

Page 13: Introduction to Spark (Intern Event Presentation)
Page 14: Introduction to Spark (Intern Event Presentation)

Ongoing Work

Speeding up Spark through code generation and binary processing (Project Tungsten)

R interface to Spark (SparkR)

Real-time machine learning library

Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

Page 15: Introduction to Spark (Intern Event Presentation)

Thank you. We’re hiring!