harnessing spark and cassandra with groovy

Harnessing the Power of Spark + Cassandra with Groovy

Steve Pember CTO, ThirdChannel Gr8Conf US, 2017

@svpember

Relational Database are Fantastic

SQL makes you Strong

@svpember

@svpember

Agenda• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

• Demo

@svpember

Apache Spark• Distributed Execution Engine

–Johnny Appleseed

“Type a quote here.”

@svpember


• What about Hadoop?

@svpember

Hadoop Spark• Map / Reduce

• Storage via HDFS

• Each calculation step written to disk

• More than Map/Reduce

• No dependent storage mechanism

• Clustered Calculations, each step in memory

@svpember



• Creation was a Happy Accident

–Johnny Appleseed


@svpember




• Architecture

–Johnny Appleseed


Your Groovy App

@svpember




• Architecture

• Programatic structure

The SparkContext submits Jobs to the Cluster

Operations are performed against RDDs

@svpember

Resilient Distributed Dataset• Immutable

• Partitioned

• Parallel operations

• Created by performing operations on other RDDs

• Reusable & Composable

@svpember

@svpember




• Architecture


• APIs

More Than Map/Reduce

@svpember

RDD operations• map

• reduce

• filter

• flatmap

• zip

• groupBy

• … plus many more

–Johnny Appleseed


@svpember




• Architecture


• APIs

• Additional Modules

Spark SQL…!

Spark Streaming!

@svpember

Agenda• Spark

• Cassandra

@svpember

Apache Cassandra (C*)• NoSql Datastore

@svpember


• Distributed

Deterministic Distribution

@svpember

@svpember


• Distributed

• High Replication

@svpember

@svpember


• Distributed


• High Durability

@svpember

@svpember


• Distributed


• High Durability

• Linear Scalability

Each new Node results in increased Storage with no loss

in performance

@svpember

@svpember


• Distributed


• High Durability


• Data Model (CQL)

Column Oriented Database

But it’s SQL-like!

@svpember

Querying

@svpember

C* Querying• select * from

• all queries must include partition key(s) in where clause

• order by limited to group keys

• cannot alter keys, queries must always be by same keys

@svpember


• Distributed


• High Durability


• Data Model (CQL)

• Designing your Data Model

@svpember

@svpember

Agenda• Spark

• Cassandra


@svpember

Spark + Cassandra• Reduce each other’s weaknesses

• Filter on the server side (with c*)

• Join tables, filter results (with Spark)

Companies have been formed

–Johnny Appleseed


Cluster Design

@svpember

Data Locality!

@svpember

Pipeline architecture

@svpember

@svpember

Agenda• Spark

• Cassandra



Coding Spark + C*

@svpember

Terminology• SparkConf

• JavaSparkContext

• JavaFunctions

• Mappers

@svpember

@svpember

Spark Conf• spark.master -> url to the master node

• spark.app.name -> want to see your client show up in the Spark UI?

• spark.executor.memory -> Limits memory per executor on workers

• spark.executor.cores -> limits cores on each worker (need to share with c*!)

• spark.submit.deployMode -> ‘client’ or ‘cluster

• spark.jars.packages -> maven / gradle type names

• spark.jars.ivy -> specify custom repos for packages

• more at: http://spark.apache.org/docs/latest/configuration.html#available-properties

http://spark.apache.org/docs/latest/configuration.html#available-properties



@svpember

Master Url Overloading• “local” -> use Spark in stand alone mode. One thread

• “local[<K>]” -> Spark, stand alone, with K threads

• “local[*]” -> Spark, stand alone, with ALL YOUR THREADS!

• “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management

• also options for Mesos and Yarn

@svpember

However, a Warning

But where does my code live?

@svpember

@svpember

CLASS_PATH: org.apache.spark,

com.fasterxml.jackson, com.yourco.yourapp.pojos.*


com.fasterxml.jackson


com.fasterxml.jackson

@svpember

Agenda• Spark

• Cassandra



• Demo

Thank You!

@svpember

@svpember

Links• Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

• Demo code: https://github.com/spember/ratpack-spark-cassandra-demo

https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

https://github.com/spember/ratpack-spark-cassandra-demo

@svpember

Images• Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling

• Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9

• Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob

• Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html

• Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html

• Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/

• Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html

• Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html

https://dzone.com/articles/ebay-secret-database-scaling

http://logisticalfictions.tumblr.com/page/9

http://www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob

http://www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html

http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html

http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/

http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html

http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html