spark introduction - grmds · complex patterns of many events, event correlation and abstraction,...
TRANSCRIPT
![Page 1: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/1.jpg)
SPARK INTRODUCTION
Josh Choi
![Page 2: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/2.jpg)
Presenter
Software engineer
13 years at NASA/Jet Propulsion Laboratory
Developing a complex event processing (CEP)
system for Deep Space Network operations
![Page 3: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/3.jpg)
What is it?
Apache SparkTM
![Page 4: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/4.jpg)
What is Apache SparkTM?
A fast and general engine for large-scale data processing
Fast: Cluster computing framework using in-memory dataset caching, computation, and data-sharing
General: Programming model does not constrain users to MapReduce
Since 2009
Matei Zaharia, AMPLab at UC Berkley
Open-source
Donated to Apache Software Foundation in 2013
![Page 5: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/5.jpg)
Stack
![Page 6: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/6.jpg)
Evolving stack
![Page 7: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/7.jpg)
Where Spark fits
![Page 8: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/8.jpg)
Spark core concepts
RDD
Resilient distributed datasets
Spark’s core abstraction for working with data
Immutable
All work in Spark expressed as:
◼ Creating new RDDs
◼ Transforming existing RDDs
◼ Calling operations on RDDs to compute a result
Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them
![Page 9: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/9.jpg)
Quick demo
![Page 10: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/10.jpg)
Lazy evaluation
![Page 11: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/11.jpg)
Why be interested?
Apache SparkTM
![Page 12: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/12.jpg)
Simple
![Page 13: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/13.jpg)
Simpler Java
![Page 14: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/14.jpg)
Fast
Spark on 206 EC2 machines, sorted 100 TB of data on disk in 23 minutes
Previous world record set by HadoopMapReduce used 2100 machines and took 72 minutes
Spark sorted the same data 3X faster using 10X fewer machines.
All the sorting took place on disk (HDFS)
![Page 15: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/15.jpg)
Flexible
Spark extends and generalizes the MapReduce execution model to allow more types of computations more efficiently
MapReduce model initially was designed for batch jobs that existed at web companies that needed to be run once a night
Single, general programming model that covers:
Interacting with data using SQL
Processing data streams
More complex applications
You only have to learn one system and you can easily make an application that combines these
Only one thing to manage
![Page 16: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/16.jpg)
Plays nice with rest of Hadoop
Part of Hadoop ecosystem
Search using Solr
Stream using Storm or Kafka
SQL using Hive
Script using Pig
Batch using MapReduce
…
In-memory: use Spark
![Page 17: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/17.jpg)
Examples
![Page 18: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/18.jpg)
Spotify music recommendation
20 millions songs: How do they recommend music to users?
Discover and Radio
Recommendation model is iterative in nature
Collaborative filtering by alternating least squares (ALS)
Suffered from Hadoop’s I/O overhead
Converted Hadoop-based computations into Spark
Scaled them up to handle 100s of billions of data points
http://www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark
![Page 19: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/19.jpg)
Fighting cancer through genomics
Today, takes more than a week to process a genome
Analysis time is often a matter of life and death
ADAM
HDFS (genome files are huge)
Transform records using Spark
SQL queries
Graph processing with GraphX
Machine learning
Target: Higher accuracy in one hour
http://bdgenomics.org
![Page 20: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/20.jpg)
Complex event processing
Event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances
Employ techniques such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership, timing, and event-driven processes
![Page 21: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/21.jpg)
CEP examples
Credit card companies
‘When 2 transactions happen on an account from radically different geographic locations within a certain time window then report as potential fraud’ (detection)
Stocks
‘When the average price of a stock falls below $25 over any 5 minute period, then immediately sell’ (prediction, avoidance)
Department of Homeland Security
“Free this week, for quick gossip/prep before I go and destroy America.”
Space missions
Spacecraft commands + telemetry → Lights-out operations
![Page 22: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/22.jpg)
CEP solutions
StreamBase
SAP Sybase Event Stream Processor
Informatica RulePoint
Apama Real-Time Analytics
TIBCO BusinessEvents
Esper
Drools
…
![Page 23: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/23.jpg)
Using Spark as CEP engine
Real-time data being generated by all subsystems are large and fast
Transient
Perfect for Spark Streaming
Both streaming data and historical data have large contextual, meta-information
Discovering what is relevant as we experiment use cases
Flexibility of Spark SQL shines
Rules not locked into a domain-specific language
System developers can use Java
Operators can use Python or Scala
![Page 24: Spark Introduction - GRMDS · complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership,](https://reader033.vdocuments.us/reader033/viewer/2022060600/605414b7c65d3273d31619b1/html5/thumbnails/24.jpg)
Going forward
Apache SparkTM official website:
https://spark.apache.org/
Latest developments on Spark:
https://databricks.com/blog
Reach me: [email protected]