introduction to spark - durham lug 20150916

Introduction to Apache Spark

www.mammothdata.com | @mammothdataco

The Leader in Big Data Consulting

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to

necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham (right above Toast)

http://www.mammothdata.com

https://twitter.com/mammothdataco



● Lead Consultant on all things DevOps and Spark

● @carsondial

Me!





● Apache Spark™ is a fast and general engine for large-scale data processing

● Not all that helpful, is it?

What Is Apache Spark?!





● Framework for massive parallel computing (cluster)

● Harnessing power of cheap memory

● Direct Acyclic Graph (DAG) computing engine

● It goes very fast!

● Apache Project (spark.apache.org)

What Is Apache Spark?! No, But Really…





● Performance

● Developer productivity

Why Spark?





● Graysort benchmark (100TB)

● Hadoop - 72 minutes / 2100 nodes / datacentre

● Spark - 23 minutes / 206 nodes / AWS

● HDFS versus Memory

Performance!





● First class support for Scala, Java, Python, and R!

● Data Science friendly

Developers!





Word Count: Hadoop





from pyspark import SparkContext

logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs:///output")

Word Count: Spark





● Spark Streaming

● GraphX (graph algorithms)

● MLLib (machine learning)

● Dataframes (data access)

Spark: Batteries Included





● Analytics (batch / streaming)

● Machine Learning

● ETL (Extract - Transform - Load)

● …and many more!

Applications





● RDD = Resilient Distributed Dataset

● Immutable, Fault-tolerant

● Operated on in parallel

● Can be created manually or from external sources

RDDs – The Building Block





● Transformations

● Actions

● Transformations are lazy

● Actions evaluate transformations in pipeline as well as performing action

RDDs – The Building Block





● map()

● filter()

● pipe()

● sample()

● …and more!

RDDs – Example Transformations





● reduce()

● count()

● take()

● saveAsTextFile()

● …and yes, more

RDDs – Example Actions





from pyspark import SparkContext

logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

wordCounts.saveAsTextFile("hdfs:///output")

Word Count: Spark





● cache() / persist()

● When an action is performed for the first time - keep the result in memory

● Different levels of persistence available

RDDs – cache()





● Micro-batches (DStreams of RDDs)

● Access to other parts of Spark (MLLib, GraphX, Dataframes)

● Fault-tolerant

● Connectors to Kafka, Flume, Kinesis, ZeroMQ

● (we’ll come back to this)

Streaming





● Spark SQL

● Support for JSON, Cassandra, SQL databases, etc.

● Easier syntax than RDDs

● Dataframes ‘borrowed’ from Python/R

● Catalyst query planner

Dataframes





val sc = new SparkContext()val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.json("people.json")

df.show()

df.filter(df("age") >= 35).show()

df.groupBy("age").count().show()

Dataframes: Example





● Optimizing query planning for Spark

● Takes Dataframe operations and ‘compiles’ them down to RDD operations

● Often faster than writing RDD code manually

● Use Dataframes whenever possible (v1.4+)

Dataframes: Catalyst





Dataframes: Catalyst





● Standalone

● YARN (Hadoop ecosystem)

● Mesos (Hipster ecosystem)

Deploying Spark





● Spark-Shell

● Zeppelin

Demos





● Spark Streaming is not ‘pure’ streaming

● Low latency requirements - use Storm

● Still immature in some ways

● Come to my All Things Open talk to learn more!

Spark for Everything?





● http://www.meetup.com/Triangle-Apache-Spark-Meetup/

● Next meeting likely to be in late October

Triangle Apache Spark Meetup Group




http://www.meetup.com/Triangle-Apache-Spark-Meetup/

http://www.meetup.com/Triangle-Apache-Spark-Meetup/


● spark.apache.org

● databricks.com

● zeppelin.incubator.apache.org

● mammothdata.com/white-papers/spark-a-modern-tool-for-big-data-applications

Links





● Questions for you! (for a $15 Digital Ocean voucher)

1. What is a RDD?2. What’s the difference between a transformation and an action?3. When wouldn’t you use Spark Streaming?

Questions?




introduction to spark - durham lug 20150916

Technology