big data processing using apache spark and clojure

69
Big Data Processing using Apache Spark and Clojure Dr. Paulus Esterhazy and Dr. Christian Betz January 2015 https://github.com/pesterhazy/, @pesterhazy https://github.com/chrisbetz/, @chris_betz

Upload: dr-christian-betz

Post on 16-Jul-2015

2.132 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Big Data Processing using Apache Spark and Clojure

Big Data Processing using Apache Spark and

ClojureDr. Paulus Esterhazy and Dr. Christian Betz

January 2015

https://github.com/pesterhazy/, @pesterhazy https://github.com/chrisbetz/, @chris_betz

Page 2: Big Data Processing using Apache Spark and Clojure

Who uses Clojure?

Page 3: Big Data Processing using Apache Spark and Clojure

Who's getting paid to use Clojure?

Page 4: Big Data Processing using Apache Spark and Clojure

Who uses BigData?

Page 5: Big Data Processing using Apache Spark and Clojure

Who uses Hadoop?

Page 6: Big Data Processing using Apache Spark and Clojure

Who uses Spark?

Page 7: Big Data Processing using Apache Spark and Clojure

About us

Page 8: Big Data Processing using Apache Spark and Clojure

Paulus red pinapple media GmbH

Page 9: Big Data Processing using Apache Spark and Clojure

Chris

Page 10: Big Data Processing using Apache Spark and Clojure

WTF is Spark?

Patrick Wendell

Databricks

Spark Performance

Common Patterns and Pitfalls for

Implementing Algorithms in Spark

Hossein Falaki

@mhfalaki

[email protected] Advanced Spark Reynold Xin, July 2, 2014 @ Spark Summit Training Disclaimer: We reuse stuff

Page 11: Big Data Processing using Apache Spark and Clojure

Apache Spark - an Overview

"Apache Spark™ is a fast and general engine for large-scale data processing."

Value proposition?

Spark keeps stuff in memory where possible, so intermediate results do not need I/O.

Spark allows quicker development cycle with proper unit tests (see later)

Spark allows to define your own data sources (JDBC in our case).

Spark allows you to work with any data structures (so some are better than others).

Page 12: Big Data Processing using Apache Spark and Clojure

Two Questions

“I like Clojure, why might I be interested in Spark?”

“Granted that Spark is useful, why program it in Clojure?”

Page 13: Big Data Processing using Apache Spark and Clojure

Two Questions

“I like Clojure, why might I be interested in Spark?”

“Granted that Spark is useful, why program it in Clojure?”

That's you!

Page 14: Big Data Processing using Apache Spark and Clojure

How Big Data is processed today

large amounts of data to process

Hadoop is the de-facto standard

Hadoop = MapReduce + HDFS

Page 15: Big Data Processing using Apache Spark and Clojure

However, Hadoop has some limitations

Pain point: performance

Writing to disk after each map-/reduce step

That's esp. bad for chains of map-/reduce steps and iterative algorithms (machine learning, PageRank)

Identified Bottleneck: HDD I/O

Page 16: Big Data Processing using Apache Spark and Clojure

Spark's Answer

Major innovation: data sharing between processing steps

In-memory processing

Page 17: Big Data Processing using Apache Spark and Clojure

Resilient Distributed Datasets (RDDs)

Datasets: Collection of elements

Distributed: Could be an on any node in the cluster.

Resilient: Could get lost (or partially lost), doesn't matter. Spark will

recompute.

Page 18: Big Data Processing using Apache Spark and Clojure

Different types of RDDs, all the same interface

Scientific Answer: RDD is an Interface!

1.  Set of partitions (“splits” in Hadoop)

2.  List of dependencies on parent RDDs

3.  Function to compute a partition"

(as an Iterator) given its parent(s)

4.  (Optional) partitioner (hash, range)

5.  (Optional) preferred location(s)"

for each partition

“lineage”

optimized

execution

Page 19: Big Data Processing using Apache Spark and Clojure

Different types of RDDs, all the same interface

Scientific Answer: RDD is an Interface!

1.  Set of partitions (“splits” in Hadoop)

2.  List of dependencies on parent RDDs

3.  Function to compute a partition"

(as an Iterator) given its parent(s)

4.  (Optional) partitioner (hash, range)

5.  (Optional) preferred location(s)"

for each partition

“lineage”

optimized

execution

Example: HadoopRDD

partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none

Page 20: Big Data Processing using Apache Spark and Clojure

Different types of RDDs, all the same interface

Scientific Answer: RDD is an Interface!

1.  Set of partitions (“splits” in Hadoop)

2.  List of dependencies on parent RDDs

3.  Function to compute a partition"

(as an Iterator) given its parent(s)

4.  (Optional) partitioner (hash, range)

5.  (Optional) preferred location(s)"

for each partition

“lineage”

optimized

execution

Example: HadoopRDD

partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none

Example: Filtered RDD partitions = same as parent RDD

dependencies = “one-to-one” on parent compute(part) = compute parent and filter it

preferredLocations(part) = none (ask parent) partitioner = none

Page 21: Big Data Processing using Apache Spark and Clojure

How are RDDs handled?

You create an RDD from a data source, e.g. an HDFS file, a Cassandra DB query, or from a JDBC-Query.

You transform RDDs (with map, filter, ...), which gives you new RDDs

You perform an action on one RDD to get the results from that RDD into your "driver". (like first, take, collect, count, ...)

Page 22: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

Page 23: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

JdbcRDD (Query)

HDFS-File (Path)

Sources define the basic RDDs

you're working on

Page 24: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

JdbcRDD (Query)

HDFS-File (Path)

map filter

Sources define the basic RDDs

you're working on

Transformations create new RDDs

Page 25: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

JdbcRDD (Query)

HDFS-File (Path)

map filter

join

Sources define the basic RDDs

you're working on

Transformations create new RDDs

Page 26: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

JdbcRDD (Query)

HDFS-File (Path)

map filter

join

filterYou provide your

own functions in here!

Sources define the basic RDDs

you're working on

Transformations create new RDDs

Page 27: Big Data Processing using Apache Spark and Clojure

Basic Building Blocks: RDDs Resilient Distributed Datasets

Spark follows a function approach:

You define collections (RDDs) and functions on collections

Sources for RDDs:

• Local collections parallelized

• HDFS files

• Your own (e.g. JDBC-RDD)

Transformations (only a selection)

• map

• filter

Actions (only a selection)

• reduce (fn)

• count

JdbcRDD (Query)

HDFS-File (Path)

map filter

join

filter

reduce

You provide your own functions in here!

Sources define the basic RDDs

you're working on

Transformations create new RDDs

Actions spit aresult to the Driver

Page 28: Big Data Processing using Apache Spark and Clojure

RDDs in PracticeExample code: https://github.com/gorillalabs/ClojureD

Page 29: Big Data Processing using Apache Spark and Clojure

In Practice 1: line count

(defn line-count [lines] (->> lines count))

(defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result))))

(process line-count)

Page 30: Big Data Processing using Apache Spark and Clojure

In Practice 2: line count cont'd

(defn line-count* [lines] (->> lines s/count))

(defn new-spark-context [] (let [c (-> (s-conf/spark-conf) (s-conf/master "local[*]") (s-conf/app-name "sparkling") (s-conf/set "spark.akka.timeout" "300") (s-conf/set conf) (s-conf/set-executor-env { "spark.executor.memory" "4G", "spark.files.overwrite" "true"}))] (s/spark-context c) ))

(defonce sc (delay (new-spark-context)))

(defn process* [f] (let [lines-rdd (s/text-file @sc "in.log")] (f lines-rdd)))

(defn line-count [lines] (->> lines count))

(defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result))))

(process line-count)

Page 31: Big Data Processing using Apache Spark and Clojure

Only go on when your tests are green!

(deftest test-line-count* (let [conf (test-conf)] (spark/with-context sc conf (testing "no lines return 0" (is (= 0 (line-count* (spark/parallelize sc []))))) (testing "a single line returns 1" (is (= 1 (line-count* (spark/parallelize sc ["this is a single line"]))))) (testing "multiple lines count correctly" (is (= 10 (line-count* (spark/parallelize sc (repeat 10 "this is a single line")))))) )))

Page 32: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...):

Page 33: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

Page 34: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

That's your table

Page 35: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

That's your table

[ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}]

Page 36: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

That's your table

[ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}]

RDDs are lists of objects

Page 37: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

That's your table

[ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}]

RDDs are lists of objects

[ #t[123 {:campaign-id 123 :active true}] #t[234 {:campaign-id 234 :active true}]]

[ #t[345 {:campaign-id 345 :active false}] #t[456 {:campaign-id 456 :active true}]

Page 38: Big Data Processing using Apache Spark and Clojure

What's an RDD? What's in it?

Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active

1 123 2014-01-01 2014-01-31 true

2 234 2014-01-06 2014-01-14 true

3 345 2014-02-01 2014-03-31 false

4 456 2014-02-10 2014-03-09 true

That's your table

[ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}]

RDDs are lists of objects

[ #t[123 {:campaign-id 123 :active true}] #t[234 {:campaign-id 234 :active true}]]

[ #t[345 {:campaign-id 345 :active false}] #t[456 {:campaign-id 456 :active true}]

PairRDDs handle key-value pairs, may have partitioners assigned,

keys not necessarily unique!

Page 39: Big Data Processing using Apache Spark and Clojure

In Practice 3: status codes

(defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry))

(defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity)))

Page 40: Big Data Processing using Apache Spark and Clojure

In Practice 4: status codes cont'd

(defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry))

(defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity)))

(defn group-by-status-code* [lines] (-> lines (s/map parse-line) (s/map-to-pair (fn [entry] (s/tuple (:status entry) 1))) (s/reduce-by-key +) (s/map (sd/key-value-fn vector)) (s/collect)))

Page 41: Big Data Processing using Apache Spark and Clojure

In Practice 5: details RDD

• Lazy evaluation is explicitly forced

• Transformation vs actions

• Serialization of Clojure functions

Page 42: Big Data Processing using Apache Spark and Clojure

In Practice 6: data sources and destinations

• Writing to HDFS

• Reading from HDFS

• HDFS is versatile: text files, S3, Cassandra

• Parallelizing regular Clojure collections

Page 43: Big Data Processing using Apache Spark and Clojure

In Practice 7: top errors

(defn top-errors [lines] (->> lines (map parse-line) (filter (fn [entry] (not= "200" (:status entry)))) (map (fn [entry] [(:uri entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (sort-by val >) (take 10)))

Page 44: Big Data Processing using Apache Spark and Clojure

In Practice 8: top errors cont'd

(defn top-errors* [lines] (-> lines (s/map parse-line) (s/filter (fn [entry] (not= "200" (:status entry)))) s/cache (s/map-to-pair (fn [entry] (s/tuple (:uri entry) 1))) (s/reduce-by-key +) ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/sort-by-key false) ;; descending order ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/map (sd/key-value-fn vector)) (s/take 10)))

Page 45: Big Data Processing using Apache Spark and Clojure

In Practice 9: caching

• enables data sharing

• avoiding data (de)serialization

• performance degrades gracefully

Page 46: Big Data Processing using Apache Spark and Clojure

Why Use Clojure to Write Spark Jobs?

Page 47: Big Data Processing using Apache Spark and Clojure

Spark and Functional Programming

• Spark is inspired by FP

• Not surprising – Scala is a functional programming language

• RDDs are immutable values

• Resilience: caches can be discarded

• DAG of transformations

• Philosophically close to Clojure

Page 48: Big Data Processing using Apache Spark and Clojure

Processing RDDs

So your application

• defines (source) RDDs,

• transforms them (which creates new RDDs with dependencies on the source RDDs)

• and runs actions on them to get results back to the driver.

This defines a Directed Acyclic Graph (DAG) of operators.

Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase.

Each stage contains tasks, working on one partition each.

Page 49: Big Data Processing using Apache Spark and Clojure

Processing RDDs

So your application

• defines (source) RDDs,

• transforms them (which creates new RDDs with dependencies on the source RDDs)

• and runs actions on them to get results back to the driver.

This defines a Directed Acyclic Graph (DAG) of operators.

Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase.

Each stage contains tasks, working on one partition each.

Example sc.textFile("/some-hdfs-data")

map#map# reduceByKey# collect#textFile#

.map(line => line.split("\t"))

.map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect()

RDD[String]

RDD[List[String]]

RDD[(String, Int)]

Array[(String, Int)]

RDD[(String, Int)]

Page 50: Big Data Processing using Apache Spark and Clojure

Processing RDDs

So your application

• defines (source) RDDs,

• transforms them (which creates new RDDs with dependencies on the source RDDs)

• and runs actions on them to get results back to the driver.

This defines a Directed Acyclic Graph (DAG) of operators.

Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase.

Each stage contains tasks, working on one partition each.

Example sc.textFile("/some-hdfs-data")

map#map# reduceByKey# collect#textFile#

.map(line => line.split("\t"))

.map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect()

RDD[String]

RDD[List[String]]

RDD[(String, Int)]

Array[(String, Int)]

RDD[(String, Int)]

Execution Graph

map#map# reduceByKey# collect#textFile#

map#

Stage#2#Stage#1#

map# reduceByKey# collect#textFile#

Page 51: Big Data Processing using Apache Spark and Clojure

Processing RDDs

So your application

• defines (source) RDDs,

• transforms them (which creates new RDDs with dependencies on the source RDDs)

• and runs actions on them to get results back to the driver.

This defines a Directed Acyclic Graph (DAG) of operators.

Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase.

Each stage contains tasks, working on one partition each.

Example sc.textFile("/some-hdfs-data")

map#map# reduceByKey# collect#textFile#

.map(line => line.split("\t"))

.map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect()

RDD[String]

RDD[List[String]]

RDD[(String, Int)]

Array[(String, Int)]

RDD[(String, Int)]

Execution Graph

map#map# reduceByKey# collect#textFile#

map#

Stage#2#Stage#1#

map# reduceByKey# collect#textFile#

Execution Graph

map#

Stage#2#Stage#1#

map# reduceByKey# collect#textFile#

Stage#2#Stage#1#read HDFS split apply both maps partial reduce write shuffle data

read shuffle data final reduce send result to driver

Page 52: Big Data Processing using Apache Spark and Clojure

Dynamic Types for Data Processing

• Clojure's strength: developer-friendly wrapper for a complex interior

• Static types everywhere

• Imperfect data

• For this use case, static typing can get in the way

• Jobs naturally represented as transformations of Clojure data structures

Page 53: Big Data Processing using Apache Spark and Clojure

Data Exploration

• Working in real time with big datasets

• Great for data mining

• Clojure's powerful REPL

• Gorilla REPL for live plotting?

Page 54: Big Data Processing using Apache Spark and Clojure

Summary: Why Spark(ling)

Data sharing: Hadoop is for a single map-reduce pass, it needs to write out intermediate result to HDFS.

Interactive data exploration: Spark keeps data in memory, opening the possibility of interactively working with TBs of data

Hadoop (and HIVE and Pig) lacks an (easy) way to implement unit tests. So writing your own code is also error-prone and development cycle is slooooow.

Page 55: Big Data Processing using Apache Spark and Clojure

Practical tips

Page 56: Big Data Processing using Apache Spark and Clojure

Running your spark code

Run locally: e.g. inside tests. Use "local" or "local[*]" as SparkMaster.

Run on cluster: either directly addressing Spark or (our case): run on top of YARN

Both open a Web interface on http://host:4040/.

Using the REPL: Open a SparkContext, define RDDs and store them in vars, perform transformations on these. Develop stuff in the REPL transfer your REPL stuff into tests.

Run inside of tests: Open local SparcContext, feed mock data, run jobs. Therefore: design for testability!

Submit a Spark Job using "spark-submit" with proper arguments (see upload.sh, run.sh).

Page 57: Big Data Processing using Apache Spark and Clojure

Best Practices / Dos and Don'ts

Shuffling is very expensive, so try to avoid it:

• Never, ever, let go of your Partitioner - this has huuuuuuge performance impact. Use map-values instead of map, keep partition when re-keying for join, etc.

• This equals: Keep your execution plan slim.

There are some tricks for this, all boiling down to proper design of your data models.

Use broadcasting where necessary.

You need to monitor memory usage, as the inability to store stuff in memory will cause spills to disc (e.g. while shuffling). This will kill you. Tune total memory and/or cache/shuffle ratios.

Page 58: Big Data Processing using Apache Spark and Clojure

Example

Page 59: Big Data Processing using Apache Spark and Clojure

ExampleMatrix Multiplication

• Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors)

Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read over and over

Page 60: Big Data Processing using Apache Spark and Clojure

ExampleMatrix Multiplication

• Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors)

Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read over and over

Spark can do much better

25

• Using cache(), keep neighbors in memory

• Do not write intermediate results on disk

Links (url, neighbors)

Ranks (url, rank)

join

join

join

Grouping same RDD over and over

Page 61: Big Data Processing using Apache Spark and Clojure

ExampleMatrix Multiplication

• Repeatedly multiply sparse matrix and vector

24

Links (url, neighbors)

Ranks (url, rank)

iteration 1 iteration 2 iteration 3

Same file read over and over

Spark can do much better

25

• Using cache(), keep neighbors in memory

• Do not write intermediate results on disk

Links (url, neighbors)

Ranks (url, rank)

join

join

join

Grouping same RDD over and over

Spark can do much better

26

• Do not partition neighbors every time

Links (url, neighbors)

Ranks (url, rank)

join

join

join

partitionBy

Same node

Page 62: Big Data Processing using Apache Spark and Clojure

Some anecdotes

Page 63: Big Data Processing using Apache Spark and Clojure

Why did I start gorillalabs/sparkling?

first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore.

Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot.

Small jobs were developed fast with Spark.

I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures.

Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“

After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“

I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling.

I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way to go.

Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!

Page 64: Big Data Processing using Apache Spark and Clojure

Having nrepl / GorillaREPL is so nice!

Having an nrepl open on my Cluster is so nice, since I can inspect stuff in my computation. Ever wondered, what that intermediate RDD contains? Just (spark/take rdd 10) it.

Using GorillaREPL, it’s like a visual workbench for big data analysis. See for yourself: http://bit.ly/1C7sSK4

Page 65: Big Data Processing using Apache Spark and Clojure

References

Page 66: Big Data Processing using Apache Spark and Clojure

Online

Sparkling: https://github.com/gorillalabs/sparkling

Flambo: https://github.com/yieldbot/flambo

flambo-example: https://github.com/pesterhazy/flambo-example

Page 67: Big Data Processing using Apache Spark and Clojure

References

http://lintool.github.io/SparkTutorial/ (where you can find the slides used in this presentation)

https://speakerdeck.com/ecepoi/apache-spark-at-viadeo

https://speakerdeck.com/ecepoi/viadeos-segmentation-platform-with-spark-on-mesos

https://speakerdeck.com/rxin/advanced-spark-at-spark-summit-2014

Page 68: Big Data Processing using Apache Spark and Clojure

Sources

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association.

(Both available as PDFs)

Page 69: Big Data Processing using Apache Spark and Clojure

Questions?