harnessing spark and cassandra with groovy
TRANSCRIPT
Harnessing the Power of Spark + Cassandra with Groovy
Steve Pember CTO, ThirdChannel Gr8Conf US, 2017
@svpember
Relational Database are Fantastic
SQL makes you Strong
@svpember
@svpember
Agenda• Spark
• Cassandra
• Spark + Cassandra
• Working with Spark + Cassandra
• Demo
@svpember
Apache Spark• Distributed Execution Engine
–Johnny Appleseed
“Type a quote here.”
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
@svpember
Hadoop Spark• Map / Reduce
• Storage via HDFS
• Each calculation step written to disk
• More than Map/Reduce
• No dependent storage mechanism
• Clustered Calculations, each step in memory
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
–Johnny Appleseed
“Type a quote here.”
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
–Johnny Appleseed
“Type a quote here.”
Your Groovy App
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
The SparkContext submits Jobs to the Cluster
Operations are performed against RDDs
@svpember
Resilient Distributed Dataset• Immutable
• Partitioned
• Parallel operations
• Created by performing operations on other RDDs
• Reusable & Composable
@svpember
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
• APIs
More Than Map/Reduce
@svpember
RDD operations• map
• reduce
• filter
• flatmap
• zip
• groupBy
• … plus many more
–Johnny Appleseed
“Type a quote here.”
@svpember
Apache Spark• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
• APIs
• Additional Modules
Spark SQL…!
JDBC?
Spark Streaming!
@svpember
Agenda• Spark
• Cassandra
@svpember
Apache Cassandra (C*)• NoSql Datastore
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
Deterministic Distribution
@svpember
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
• High Replication
@svpember
@svpember
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
• High Replication
• High Durability
@svpember
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
Each new Node results in increased Storage with no loss
in performance
@svpember
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
• Data Model (CQL)
Column Oriented Database
But it’s SQL-like!
@svpember
@svpember
@svpember
Querying
@svpember
C* Querying• select * from
• all queries must include partition key(s) in where clause
• order by limited to group keys
• cannot alter keys, queries must always be by same keys
@svpember
Apache Cassandra (C*)• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
• Data Model (CQL)
• Designing your Data Model
@svpember
@svpember
@svpember
Agenda• Spark
• Cassandra
• Spark + Cassandra
@svpember
Spark + Cassandra• Reduce each other’s weaknesses
• Filter on the server side (with c*)
• Join tables, filter results (with Spark)
Companies have been formed
–Johnny Appleseed
“Type a quote here.”
Cluster Design
@svpember
Data Locality!
@svpember
@svpember
Pipeline architecture
@svpember
@svpember
Agenda• Spark
• Cassandra
• Spark + Cassandra
• Working with Spark + Cassandra
Coding Spark + C*
@svpember
Terminology• SparkConf
• JavaSparkContext
• JavaFunctions
• Mappers
@svpember
@svpember
Spark Conf• spark.master -> url to the master node
• spark.app.name -> want to see your client show up in the Spark UI?
• spark.executor.memory -> Limits memory per executor on workers
• spark.executor.cores -> limits cores on each worker (need to share with c*!)
• spark.submit.deployMode -> ‘client’ or ‘cluster
• spark.jars.packages -> maven / gradle type names
• spark.jars.ivy -> specify custom repos for packages
• more at: http://spark.apache.org/docs/latest/configuration.html#available-properties
@svpember
Master Url Overloading• “local” -> use Spark in stand alone mode. One thread
• “local[<K>]” -> Spark, stand alone, with K threads
• “local[*]” -> Spark, stand alone, with ALL YOUR THREADS!
• “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management
• also options for Mesos and Yarn
@svpember
However, a Warning
But where does my code live?
@svpember
@svpember
CLASS_PATH: org.apache.spark,
com.fasterxml.jackson, com.yourco.yourapp.pojos.*
CLASS_PATH: org.apache.spark,
com.fasterxml.jackson
CLASS_PATH: org.apache.spark,
com.fasterxml.jackson
@svpember
Agenda• Spark
• Cassandra
• Spark + Cassandra
• Working with Spark + Cassandra
• Demo
Thank You!
@svpember
@svpember
Links• Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
• Demo code: https://github.com/spember/ratpack-spark-cassandra-demo
@svpember
Images• Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling
• Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9
• Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob
• Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html
• Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html
• Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/
• Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html
• Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html