introduction to spark (intern event presentation)

Introduction to Spark

Matei Zaharia Databricks Intern Event, August 2015

What is Apache Spark?

Fast and general computing engine for clusters

Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

About Databricks

Founded by creators of Spark in 2013

Offers a hosted cloud service built on Spark •  Interactive workspace with notebooks, dashboards, jobs

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Cont

ribut

ors

Contributors / Month to Spark

Community Growth

Most active open source project in big data

Spark Programming Model

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “MySQL” in s).count()

messages.filter(lambda s: “Redis” in s).count()

. . .

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Iterative algorithm used in machine learning

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Higher-Level Libraries

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Over 1000 production users, clusters up to 8000 nodes

Many talks online at spark-summit.org

Spark Community

Ongoing Work

Speeding up Spark through code generation and binary processing (Project Tungsten)

R interface to Spark (SparkR)

Real-time machine learning library

Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

Thank you. We’re hiring!

introduction to spark (intern event presentation)

Software