introduction to spark - durham lug 20150916
TRANSCRIPT
Introduction to Apache Spark
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.
● Installation○ Installation of Hadoop or relevant technology.
● Data Consolidation○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial
Me!
www.mammothdata.com | @mammothdataco
● Apache Spark™ is a fast and general engine for large-scale data processing
● Not all that helpful, is it?
What Is Apache Spark?!
www.mammothdata.com | @mammothdataco
● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…
www.mammothdata.com | @mammothdataco
● Performance
● Developer productivity
Why Spark?
www.mammothdata.com | @mammothdataco
● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!
www.mammothdata.com | @mammothdataco
● First class support for Scala, Java, Python, and R!
● Data Science friendly
Developers!
www.mammothdata.com | @mammothdataco
Word Count: Hadoop
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
● Spark Streaming
● GraphX (graph algorithms)
● MLLib (machine learning)
● Dataframes (data access)
Spark: Batteries Included
www.mammothdata.com | @mammothdataco
● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications
www.mammothdata.com | @mammothdataco
● RDD = Resilient Distributed Dataset
● Immutable, Fault-tolerant
● Operated on in parallel
● Can be created manually or from external sources
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● Transformations
● Actions
● Transformations are lazy
● Actions evaluate transformations in pipeline as well as performing action
RDDs – The Building Block
www.mammothdata.com | @mammothdataco
● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations
www.mammothdata.com | @mammothdataco
● reduce()
● count()
● take()
● saveAsTextFile()
● …and yes, more
RDDs – Example Actions
www.mammothdata.com | @mammothdataco
from pyspark import SparkContext
logFile = "hdfs:///input"sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark
www.mammothdata.com | @mammothdataco
● cache() / persist()
● When an action is performed for the first time - keep the result in memory
● Different levels of persistence available
RDDs – cache()
www.mammothdata.com | @mammothdataco
● Micro-batches (DStreams of RDDs)
● Access to other parts of Spark (MLLib, GraphX, Dataframes)
● Fault-tolerant
● Connectors to Kafka, Flume, Kinesis, ZeroMQ
● (we’ll come back to this)
Streaming
www.mammothdata.com | @mammothdataco
● Spark SQL
● Support for JSON, Cassandra, SQL databases, etc.
● Easier syntax than RDDs
● Dataframes ‘borrowed’ from Python/R
● Catalyst query planner
Dataframes
www.mammothdata.com | @mammothdataco
val sc = new SparkContext()val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example
www.mammothdata.com | @mammothdataco
● Optimizing query planning for Spark
● Takes Dataframe operations and ‘compiles’ them down to RDD operations
● Often faster than writing RDD code manually
● Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
Dataframes: Catalyst
www.mammothdata.com | @mammothdataco
● Standalone
● YARN (Hadoop ecosystem)
● Mesos (Hipster ecosystem)
Deploying Spark
www.mammothdata.com | @mammothdataco
● Spark-Shell
● Zeppelin
Demos
www.mammothdata.com | @mammothdataco
● Spark Streaming is not ‘pure’ streaming
● Low latency requirements - use Storm
● Still immature in some ways
● Come to my All Things Open talk to learn more!
Spark for Everything?
www.mammothdata.com | @mammothdataco
● http://www.meetup.com/Triangle-Apache-Spark-Meetup/
● Next meeting likely to be in late October
Triangle Apache Spark Meetup Group
www.mammothdata.com | @mammothdataco
● spark.apache.org
● databricks.com
● zeppelin.incubator.apache.org
● mammothdata.com/white-papers/spark-a-modern-tool-for-big-data-applications
Links
www.mammothdata.com | @mammothdataco
● Questions for you! (for a $15 Digital Ocean voucher)
1. What is a RDD?2. What’s the difference between a transformation and an action?3. When wouldn’t you use Spark Streaming?
Questions?