intro to spark development
TRANSCRIPT
![Page 1: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/1.jpg)
INTRO TO SPARK DEVELOPMENT
June 2015: Spark Summit West / San Francisco
http://training.databricks.com/intro.pdf
https://www.linkedin.com/in/bclapper
![Page 2: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/2.jpg)
making big data simple
Databricks Cloud:“A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.”
• Founded in late 2013• by the creators of Apache Spark• Original team from UC Berkeley AMPLab• Raised $47 Million in 2 rounds• ~55 employees• We’re hiring! • Level 2/3 support partnerships with
• Hortonworks• MapR• DataStax
(http://databricks.workable.com)
![Page 3: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/3.jpg)
The Databricks team contributed more than 75% of the code added to Spark in the past year
![Page 4: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/4.jpg)
AGENDA• History of Big Data & Spark
• RDD fundamentals
• Databricks UI demo
• Lab: DevOps 101
• Transformations & Actions
Before Lunch• Transformations & Actions (continued)
• Lab: Transformations & Actions
• Dataframes
• Lab: Dataframes
• Spark UIs
• Resource Managers: Local & Stanalone
• Memory and Persistence
• Spark Streaming
• Lab: MISC labs
After Lunch
![Page 5: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/5.jpg)
Some slides will be skipped
Please keep Q&A low during class
(5pm – 5:30pm for Q&A with instructor)
2 anonymous surveys: Pre and Post class
Lunch: noon – 1pm
2 breaks (sometime before lunch and after lunch)
![Page 6: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/6.jpg)
Homepage: http://www.ardentex.com/LinkedIn: https://www.linkedin.com/in/bclapper
@brianclapper
- 30 years experience building & maintaining software systems
- Scala, Python, Ruby, Java, C, C#
- Founder of Philadelphia area Scala user group (PHASE)
- Spark instructor for Databricks
INSTRUCTOR: BRIAN CLAPPER
![Page 7: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/7.jpg)
Sales / Marketing
Data Scientist
Management / Exec
Administrator / Ops
Developer
0 10 20 30 40 50 60 70 80
Survey completed by 58 out of 115 studentsYOUR JOB?
![Page 8: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/8.jpg)
Survey completed by 58 out of 115 students
SF Bay Area41%
CA12%
West US5%
East US24%
Europe4%
Asia10%
Intern. - O3%
TRAVELED FROM?
![Page 9: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/9.jpg)
Retail / Distributor
Healthcare/ Medical
Academia / University
Telecom
Science & Tech
Banking / Finance
IT / Systems
0 5 10 15 20 25 30 35 40 45 50
Survey completed by 58 out of 115 studentsWHICH INDUSTRY?
![Page 10: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/10.jpg)
Vendor Training
SparkCamp
AmpCamp
None
0 10 20 30 40 50 60 70 80 90 100
Survey completed by 58 out of 115 studentsPRIOR SPARK TRAINING?
![Page 11: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/11.jpg)
Survey completed by 58 out of 115 students
Zero47%
< 1 week26%
< 1 month22%
1+ months4%
HANDS ON EXPERIENCE WITH SPARK?
![Page 12: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/12.jpg)
Survey completed by 58 out of 115 students
Reading58%
1-node VM19%
POC / Pro-totype21%
Production2%
SPARK USAGE LIFECYCLE?
![Page 13: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/13.jpg)
Survey completed by 58 out of 115 studentsPROGRAMMING EXPERIENCE
![Page 14: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/14.jpg)
Survey completed by 58 out of 115 studentsPROGRAMMING EXPERIENCE
![Page 15: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/15.jpg)
Survey completed by 58 out of 115 studentsPROGRAMMING EXPERIENCE
![Page 16: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/16.jpg)
Survey completed by 58 out of 115 students
BIG DATA EXPERIENCE
![Page 17: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/17.jpg)
Use Cases
Architecture
Administrator / Ops
Development
0 10 20 30 40 50 60 70 80 90 100
Survey completed by 58 out of 115 studentsFOCUS OF CLASS?
![Page 18: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/18.jpg)
NoSQL battles
STORAGE VS PROCESSING WARS Compute
battles
HBase vs Cassanrdra
Relational vs NoSQL
Redis vs Memcached vs Riak
MongoDB vs CouchDB vs Couchbase
MapReduce vs Spark
Spark Streaming vs Storm
Hive vs Spark SQL vs Impala
Mahout vs MLlib vs H20
(then) (now)
Solr vs Elasticsearch
Neo4j vs Titan vs Giraph vs OrientDB
![Page 19: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/19.jpg)
NoSQL battles
STORAGE VS PROCESSING WARS Compute
battles
HBase vs Cassanrdra
Relational vs NoSQL
Redis vs Memcached vs Riak
MongoDB vs CouchDB vs Couchbase
Neo4j vs Titan vs Giraph vs OrientDB
MapReduce vs Spark
Spark Streaming vs Storm
Hive vs Spark SQL vs Impala
Mahout vs MLlib vs H20
(then) (now)
Solr vs Elasticsearch
![Page 20: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/20.jpg)
NOSQL POPULARITY WINNERS
Key -> Value Key -> Doc Column Family Graph Search
Redis - 95Memcached - 33DynamoDB - 16Riak - 13
MongoDB - 279CouchDB - 28Couchbase - 24DynamoDB – 15MarkLogic - 11
Cassandra - 109HBase - 62
Neo4j - 30OrientDB - 4Titan – 3Giraph - 1
Solr - 81Elasticsearch - 70Splunk – 41
![Page 21: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/21.jpg)
General Batch Processing
Pregel
Dremel
ImpalaGraphLab
Giraph
DrillTez
S4Storm
Specialized Systems(iterative, interactive, ML, streaming, graph, SQL,
etc) General Unified Engine
(2004 – 2013)
(2007 – 2015?)
(2014 – ?)
Mahout
![Page 22: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/22.jpg)
Scheduling Monitoring Distributing
![Page 23: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/23.jpg)
RDBMS
Streaming
SQL
GraphXHadoop Input Format
Apps
Distributions:- CDH- HDP- MapR- DSE
Tachyon
MLlib
DataFrames API
![Page 24: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/24.jpg)
![Page 25: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/25.jpg)
- Developers from 50+ companies
- 400+ developers
- Apache Committers from 16+ organizations
![Page 27: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/27.jpg)
10x – 100x
![Page 28: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/28.jpg)
Aug 2009
Source: openhub.net
...in June 2013
![Page 29: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/29.jpg)
DISTRIBUTORS APPLICATIONS
![Page 30: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/30.jpg)
CPUs: 10 GB/s
100 MB/s
0.1 ms random access
$0.45 per GB
600 MB/s
3-12 ms random access
$0.05 per GB
1 Gb/s or 125 MB/sNetwork
0.1 Gb/s
Nodes in another rack
Nodes in same rack
1 Gb/s or 125 MB/s
![Page 31: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/31.jpg)
June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations.
RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”
![Page 32: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/32.jpg)
April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.
In both cases, keeping data in memory can improve performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award” - NSDI 2012
- Cited 400+ times!
![Page 33: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/33.jpg)
STREAMING
TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))
- 2 Streaming Paper(s) have been cited 138 times
Analyze real time streams of data in ½ second intervals
![Page 34: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/34.jpg)
sqlCtx = new HiveContext(sc)results = sqlCtx.sql( "SELECT * FROM people")names = results.map(lambda p: p.name)
SQLSeemlessly mix SQL queries with Spark programs.
![Page 35: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/35.jpg)
GRAPHX
graph = Graph(vertices, edges)messages = spark.textFile("hdfs://...")graph2 = graph.joinVertices(messages) { (id, vertex, msg) => ...}
Analyze networks of nodes and edges using graph processing
https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
![Page 36: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/36.jpg)
SQL queries with Bounded Errors and Bounded Response Times
https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
![Page 37: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/37.jpg)
Estim
ate
# of data points
true answer
How do you know when to stop?
![Page 38: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/38.jpg)
Estim
ate
# of data points
true answer
Error bars on every answer!
![Page 39: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/39.jpg)
Estim
ate
# of data points
true answer
Stop when error smaller than a given
threshold
time
![Page 40: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/40.jpg)
http://shop.oreilly.com/product/0636920028512.do
eBook: $33.99Print: $39.99
PDF, ePub, Mobi, DAISY
Shipping now!
http://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624
$30 @ Amazon:
![Page 41: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/41.jpg)
![Page 42: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/42.jpg)
![Page 43: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/43.jpg)
Spark sorted the same data 3X faster using 10X fewer machines than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
All the sorting took place on disk (HDFS) without using Spark’s in-memory cache!
![Page 44: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/44.jpg)
![Page 45: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/45.jpg)
- Stresses “shuffle” which underpins everything from SQL to MLlib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
WHY SORTING?
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)- Netty native network transport (SPARK-2468)- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout- Pipelining
Reference
![Page 46: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/46.jpg)
ExRDD
W
RDD
TT
TECHNIQUE USED FOR 100 TB SORT
EC2: i2.8xlarge(206
workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores- 244 GB of RAM- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed it manually (i.e. never triggered the GC)
- 32 slots per machine- 6,592 slots total
Reference
![Page 47: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/47.jpg)
![Page 48: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/48.jpg)
RDD FUNDAMENTALS
![Page 49: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/49.jpg)
INTERACTIVE SHELL
(Scala & Python only)
![Page 50: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/50.jpg)
Driver Program
ExRDD
W
RDD
TT
ExRDD
W
RDD
TT
Worker Machine
Worker Machine
![Page 51: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/51.jpg)
item-1item-2item-3item-4item-5
item-6item-7item-8item-9item-10
item-11item-12item-13item-14item-15
item-16item-17item-18item-19item-20
item-21item-22item-23item-24item-25
RDD
ExRDD
W
RDD
ExRDD
W
RDD
ExRDD
W
more partitions = more parallelism
![Page 52: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/52.jpg)
Error, ts, msg1Warn, ts, msg2Error, ts, msg1
RDD w/ 4 partitions
Info, ts, msg8Warn, ts, msg2Info, ts, msg8
Error, ts, msg3Info, ts, msg5Info, ts, msg5
Error, ts, msg4Warn, ts, msg9Error, ts, msg1
An RDD can be created 2 ways:
- Parallelize a collection- Read data from an external source (S3, C*,
HDFS, etc)
logLinesRDD
![Page 53: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/53.jpg)
PARALLELIZE# Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scalaval wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in JavaJavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-memory collection and pass it to SparkContext’s parallelize method
- Not generally used outside of prototyping and testing since it requires entire dataset in memory on one machine
![Page 54: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/54.jpg)
READ FROM TEXT FILE# Read a local txt file in PythonlinesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scalaval linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in JavaJavaRDD<String> lines = sc.textFile("/path/to/README.md");
- There are other methods to read data from HDFS, C*, S3, HBase, etc.
![Page 55: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/55.jpg)
Error, ts, msg1Warn, ts, msg2Error, ts, msg1
Info, ts, msg8Warn, ts, msg2Info, ts, msg8
Error, ts, msg3Info, ts, msg5Info, ts, msg5
Error, ts, msg4Warn, ts, msg9Error, ts, msg1 logLinesRD
D
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)
![Page 56: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/56.jpg)
errorsRDD
.coalesce( 2 )
Error, ts, msg1Error, ts, msg3Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
![Page 57: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/57.jpg)
.collect( )
Execute DAG!
Driver
![Page 58: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/58.jpg)
.collect( )
Driver
logLinesRDD
![Page 59: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/59.jpg)
.collect( )
logLinesRDD
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1Error, ts, msg3Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
![Page 60: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/60.jpg)
.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)
PipelinedStage-1
![Page 61: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/61.jpg)
Driver
logLinesRDD
errorsRDD
cleanedRDD
![Page 62: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/62.jpg)
Driverdata
![Page 63: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/63.jpg)
logLinesRDD
errorsRDD
Error, ts, msg1Error, ts, msg3Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1
Error, ts, msg1 errorMsg1R
DD.collect( )
.saveToCassandra( )
.count( )
5
![Page 64: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/64.jpg)
logLinesRDD
errorsRDD
Error, ts, msg1Error, ts, msg3Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1
Error, ts, msg1 errorMsg1R
DD.collect( )
.count( )
.saveToCassandra( )
5
![Page 65: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/65.jpg)
LIFECYCLE OF A SPARK PROGRAM
1)Create some input RDDs from external data or parallelize a collection in your driver program.
2)Lazily transform them to define new RDDs using transformations like filter() or map()
3)Ask Spark to cache() any intermediate RDDs that will need to be reused.
4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.
![Page 66: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/66.jpg)
TRANSFORMATIONS
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex()
sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
(lazy)
- Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations
![Page 67: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/67.jpg)
ACTIONS
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
![Page 68: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/68.jpg)
TYPES OF RDDS
• HadoopRDD• FilteredRDD• MappedRDD• PairRDD• ShuffledRDD• UnionRDD• PythonRDD
• DoubleRDD• JdbcRDD• JsonRDD• SchemaRDD• VertexRDD• EdgeRDD
• CassandraRDD (DataStax) • GeoRDD (ESRI)
• EsSpark (ElasticSearch)
![Page 69: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/69.jpg)
![Page 70: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/70.jpg)
![Page 71: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/71.jpg)
“Simple things should be simple, complex things should be possible” - Alan Kay
![Page 72: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/72.jpg)
DEMO: DATABRICKS GUI
![Page 73: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/73.jpg)
https://classeast01.cloud.databricks.com
https://classeast02.cloud.databricks.com
- 60 user accounts
- 60 user accounts
- 60 user clusters- 1 community cluster
- 60 user clusters- 1 community cluster
- Users: 1000 – 1980
- Users: 1000 – 1980
![Page 74: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/74.jpg)
Databricks Guide (5 mins)
DevOps 101 (30 mins)
DevOps 102 (30 mins)Transformations &Actions (30 mins)
SQL 101 (30 mins)
Dataframes (20 mins)
![Page 75: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/75.jpg)
Switch to Transformations & Actions slide deck….
![Page 76: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/76.jpg)
SPARK SQL + DATAFRAMES
UserID Name Age Location Pet28492942 John Galt 32 New York Sea Horse95829324 Winston Smith 41 Oceania Ant92871761 Tom Sawyer 17 Mississippi Raccoon37584932 Carlos Hinojosa 33 Orlando Cat73648274 Luis Rodriguez 34 Orlando Dogs
![Page 78: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/78.jpg)
![Page 79: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/79.jpg)
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
![Page 80: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/80.jpg)
DATAFRAMES
• Announced Feb 2015
• Inspired by data frames in R and Pandas in Python
• Works in:
Features• Scales from KBs to PBs
• Supports wide array of data formats and storage systems (Hive, existing RDDs, etc)
• State-of-the-art optimization and code generation via Spark SQL Catalyst optimizer
• APIs in Python, Java
• a distributed collection of data organized into named columns
• Like a table in a relational database
What is a Dataframe?
![Page 81: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/81.jpg)
DATAFRAMESStep 1: Construct a DataFrame
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
df = sqlContext.jsonFile("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdoutdf.show()## age name## null Michael## 30 Andy## 19 Justin
![Page 82: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/82.jpg)
DATAFRAMESStep 2: Use the DataFrame
# Print the schema in a tree formatdf.printSchema()## root## |-- age: long (nullable = true)## |-- name: string (nullable = true)
# Select only the "name" columndf.select("name").show()## name## Michael## Andy## Justin
# Select everybody, but increment the age by 1df.select("name", df.age + 1).show()## name (age + 1)## Michael null## Andy 31## Justin 20
![Page 83: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/83.jpg)
DATAFRAMESSQL Integration
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT * FROM table")
![Page 84: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/84.jpg)
DATAFRAMESSQL + RDD Integration
2 methods for converting existing RDDs into DataFrames:
1. Use reflection to infer the schema of an RDD that contains different types of objects
2. Use a programmatic interface that allows you to construct a schema and then apply it to an existing RDD.
(more concise)
(more verbose)
![Page 85: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/85.jpg)
DATAFRAMESSQL + RDD Integration: via reflection
# sc is an existing SparkContext.from pyspark.sql import SQLContext, RowsqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.lines = sc.textFile("examples/src/main/resources/people.txt")parts = lines.map(lambda l: l.split(","))people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.schemaPeople = sqlContext.inferSchema(people)schemaPeople.registerTempTable("people")
![Page 86: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/86.jpg)
DATAFRAMESSQL + RDD Integration: via reflection
# SQL can be run over DataFrames that have been registered as a table.teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.teenNames = teenagers.map(lambda p: "Name: " + p.name)for teenName in teenNames.collect(): print teenName
![Page 87: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/87.jpg)
DATAFRAMESSQL + RDD Integration: via programmatic schema
DataFrame can be created programmatically with 3 steps:
1. Create an RDD of tuples or lists from the original RDD
2. Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1
3. Apply the schema to the RDD via createDataFrame method provided by SQLContext
![Page 88: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/88.jpg)
DATAFRAMESStep 1: Construct a DataFrame
# Constructs a DataFrame from the users table in Hive.users = context.table("users")
# from JSON files in S3logs = context.load("s3n://path/to/data.json", "json")
![Page 89: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/89.jpg)
DATAFRAMESStep 2: Use the DataFrame
# Create a new DataFrame that contains “young users” onlyyoung = users.filter(users.age < 21)
# Alternatively, using Pandas-like syntaxyoung = users[users.age < 21]
# Increment everybody’s age by 1young.select(young.name, young.age + 1)
# Count the number of young users by genderyoung.groupBy("gender").count()
# Join young users with another DataFrame called logsyoung.join(logs, logs.userId == users.userId, "left_outer")
![Page 90: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/90.jpg)
SPARK UI
![Page 91: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/91.jpg)
![Page 92: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/92.jpg)
![Page 93: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/93.jpg)
![Page 94: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/94.jpg)
![Page 95: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/95.jpg)
![Page 96: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/96.jpg)
![Page 97: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/97.jpg)
![Page 98: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/98.jpg)
![Page 99: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/99.jpg)
1.4.0 Event timeline all jobs page
![Page 100: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/100.jpg)
1.4.0Event timeline within 1 job
![Page 101: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/101.jpg)
1.4.0
Event timeline within 1 stage
![Page 102: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/102.jpg)
1.4.0
sc.textFile(“blog.txt”) .cache() .flatMap { line => line.split(“ “) } .map { word => (word, 1) } .reduceByKey { case (count1, count2) => count1 + count2 } .collect()
![Page 103: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/103.jpg)
1.4.0
![Page 104: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/104.jpg)
SPARK RESOURCE MANAGERS
![Page 105: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/105.jpg)
WHAT ARE TASKS?
Task-1 Task-2Task-3
Task-4
logLinesRDD
errorsRDD
![Page 106: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/106.jpg)
Don't know
Mesos
Apache + Standalone
C* + Standalone
Hadoop YARN
Databricks Cloud
0 5 10 15 20 25 30 35 40 45 50
HOW WILL YOU DEPLOY SPARK?Survey completed by 58 out of 115 students
![Page 107: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/107.jpg)
Different Cloud
On-prem
Amazon Cloud
0 10 20 30 40 50 60 70
WHERE WILL YOU DEPLOY SPARK?Survey completed by 58 out of 115 students
![Page 108: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/108.jpg)
JobTracker
DNTTMM
R MM
R MM
R MM
RMMM
MM
MRR R
R
OSOSOSOS
JT
DN DNTT DNTT TT
2 MR APPS RUNNING
History:
NameNode
NN
![Page 110: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/110.jpg)
LOCAL MODE
![Page 111: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/111.jpg)
JVM: Ex + Driver
Disk
RDD, P1 Task
3 options:- local- local[N] - local[*]
LOCAL MODE
RDD, P2
RDD, P1
RDD, P2
RDD, P3
Task
Task
Task
Task
Task
CPUs:
Task
Task
Task
Task
Task
Task
Internal
Threads
val conf = new SparkConf() .setMaster("local[12]") .setAppName(“MyFirstApp") .set("spark.executor.memory", “3g")val sc = new SparkContext(conf)
> ./bin/spark-shell --master local[12]
> ./bin/spark-submit --name "MyFirstApp" --master
local[12] myApp.jar
Worker Machine
![Page 112: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/112.jpg)
STANDALONE MODE
![Page 113: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/113.jpg)
ExRDD, P1
W
Driver
SPARK S ANDALONE
RDD, P2RDD, P1
T TT TT T
Internal
Threads
SSD
SSD
OS Disk
SSD
SSD
ExRDD, P4
W
RDD, P6RDD, P1
T TT TT T
Internal
Threads
SSD
SSD
OS Disk
SSD
SSD
ExRDD, P7
W
RDD, P8RDD, P2
T TT TT T
Internal
Threads
SSD
SSD
OS Disk
SSD
SSD
Spark Master
ExRDD, P5
W
RDD, P3RDD, P2
T TT TT T
Internal
Threads
SSD
SSD
OS Disk
SSD
SSD
T TT T
different spark-env.sh
- SPARK_WORKER_CORES
vs.
> ./bin/spark-submit --name “SecondApp" --master spark://host4:port1 myApp.jar - SPARK_LOCAL_DIRSspark-env.sh
![Page 114: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/114.jpg)
PLUGGABLE RESOURCE MANAGEMENT
Spark Central Master
Who starts Executors?
Tasks run in
Local [none] Human being ExecutorStandalone
Standalone Master Worker JVM Executor
YARN YARN App Master Node Manager ExecutorMesos Mesos Master Mesos Slave Executor
![Page 115: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/115.jpg)
DEPLOYING AN APP TO THE CLUSTERspark-submit provides a uniform interface for submitting jobs across all cluster managers
bin/spark-submit --master spark://host:7077 --executor-memory 10g my_script.py
Source: Learning Spark
![Page 116: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/116.jpg)
MEMORY AND PERSISTENCE
![Page 117: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/117.jpg)
ExRDD, P1RDD, P2RDD, P1
T TT TT T
Internal
Threads
Recommended to use at most only 75% of a machine’s memory for Spark
Minimum Executor heap size should be 8 GB
Max Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and serialization format
![Page 118: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/118.jpg)
+Vs.
![Page 119: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/119.jpg)
Persistence descriptionMEMORY_ONLY Store RDD as deserialized Java objects in
the JVMMEMORY_AND_DISK Store RDD as deserialized Java objects in
the JVM and spill to diskMEMORY_ONLY_SER Store RDD as serialized Java objects (one
byte array per partition)MEMORY_AND_DISK_SER Spill partitions that don't fit in memory to
disk instead of recomputing them on the fly each time they're needed
DISK_ONLY Store the RDD partitions only on diskMEMORY_ONLY_2,
MEMORY_AND_DISK_2Same as the levels above, but replicate each partition on two cluster nodes
OFF_HEAP Store RDD in serialized format in Tachyon
![Page 120: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/120.jpg)
RDD.cache() == RDD.persist(MEMORY_ONLY)
JVM
deserialized
most CPU-efficient option
![Page 121: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/121.jpg)
![Page 122: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/122.jpg)
RDD.persist(MEMORY_ONLY_SER)
JVM
serialized
![Page 123: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/123.jpg)
+
.persist(MEMORY_AND_DISK)
JVM
deserialized
Ex
W
RDD-P1
TT
OS Disk
SSD
RDD-P1
RDD-P2
![Page 124: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/124.jpg)
+
.persist(MEMORY_AND_DISK_SER)
JVM
serialized
![Page 125: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/125.jpg)
.persist(DISK_ONLY)
JVM
![Page 126: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/126.jpg)
RDD.persist(MEMORY_ONLY_2)
JVM on Node X
deserialized
deserialized
JVM on Node Y
![Page 127: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/127.jpg)
+
.persist(MEMORY_AND_DISK_2)
JVM
deserialized
+
JVM
deserialized
![Page 128: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/128.jpg)
.persist(OFF_HEAP)
JVM-1 / App-1
serialized
Tachyon
JVM-2 / App-1
JVM-7 / App-2
![Page 129: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/129.jpg)
.unpersist()
JVM
![Page 130: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/130.jpg)
JVM
?
![Page 131: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/131.jpg)
Intermediate data is automatically persisted during shuffle operations
Remember!
![Page 132: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/132.jpg)
60%20%
20%
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs(remainder)
Shuffle memory
spark.storage.memoryFraction
spark.shuffle.memoryFraction
![Page 133: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/133.jpg)
MEMORY
RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction
Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle.
User code: Spark executes arbitrary user code, so user functions can themselvesrequire substantial memory. For instance, if a user application allocates large arraysor other objects, these will content for overall memory usage. User code has accessto everything “left” in the JVM heap after the space for RDD storage and shufflestorage are allocated.
Spark uses memory for:
![Page 134: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/134.jpg)
DATA SERIALIZATION
![Page 135: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/135.jpg)
SERIALIZATION
Serialization is used when:
Transferring data over the network
Spilling data to disk
Caching to memory serialized
Broadcasting variables
![Page 136: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/136.jpg)
Java serialization
Kryo serialization
vs.
• Uses Java’s ObjectOutputStream framework
• Works with any class you create that implements java.io.Serializable
• You can control the performance of serialization more closely by extending java.io.Externalizable
• Flexible, but quite slow
• Leads to large serialized formats for many classes
• Recommended serialization for production apps
• Use Kyro version 2 for speedy serialization (10x) and more compactness
• Does not support all Serializable types
• Requires you to register the classes you’ll use in advance
• If set, will be used for serializing shuffle data between nodes and also serializing RDDs to disk
conf.set(“spark.serializer”, "org.apache.spark.serializer.KryoSerializer")
![Page 137: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/137.jpg)
BROADCAST VARIABLES&
ACCUMULATORS
++ +
![Page 138: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/138.jpg)
Ex
Ex
Exx = 5
T
T
x = 5
x = 5
x = 5
x = 5
T
T
![Page 139: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/139.jpg)
• Broadcast variables – Send a large read-only lookup table to all the nodes, or send a large feature vector in a ML algorithm to all nodes
• Accumulators – count events that occur during job execution for debugging purposes. Example: How many lines of the input file were blank? Or how many corrupt records were in the input dataset?
++ +
USE CASES:
![Page 140: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/140.jpg)
Spark supports 2 types of shared variables:
• Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes.
• Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only.
++ +
![Page 141: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/141.jpg)
Broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
For example, to give every node a copy of a large input dataset efficiently
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost
BROADCAST VARIABLES
![Page 142: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/142.jpg)
val broadcastVar = sc.broadcast(Array(1, 2, 3))broadcastVar.value
broadcastVar = sc.broadcast(list(range(1, 4)))broadcastVar.value
Scala:
Python:
BROADCAST VARIABLES
![Page 143: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/143.jpg)
ACCUMULATORS
Accumulators are variables that can only be “added” to through an associative operation
Used to implement counters and sums, efficiently in parallel
Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types
Only the driver program can read an accumulator’s value, not the tasks
++ +
![Page 144: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/144.jpg)
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
accum = sc.accumulator(0)rdd = sc.parallelize([1, 2, 3, 4])def f(x): global accum accum += x
rdd.foreach(f)
accum.value
Scala:
Python:
ACCUMULATORS
++ +
![Page 145: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/145.jpg)
Next slide is only for on-site students…
![Page 146: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/146.jpg)
tinyurl.com/spark summit intro survey
![Page 147: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/147.jpg)
STREAMING
TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))
![Page 148: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/148.jpg)
STREAMING
Kafka
Flume
HDFS
S3
Kinesis
TCP socket
HDFS
Cassandra
Dashboards
Databases
- Scalable- High-throughput- Fault-tolerant
Complex algorithms can be expressed using: - Spark transformations: map(), reduce(), join(),
etc- MLlib + GraphX- SQL
![Page 149: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/149.jpg)
STREAMING
Batch Realtime
One unified API
![Page 150: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/150.jpg)
Tathagata Das (TD)- Lead developer of Spark Streaming + Committer
on Apache Spark core
- Helped re-write Spark Core internals in 2012 to make it 10x faster to support Streaming use cases
- On leave from UC Berkeley PhD program
- Ex: Intern @ Amazon, Intern @ Conviva, Research Assistant @ Microsoft Research India
STREAMING
- Scales to 100s of nodes
- Batch sizes as small at half a second
- Processing latency as low as 1 second
- Exactly-once semantics no matter what fails
![Page 151: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/151.jpg)
USE CASES
Page views
STREAMING
Kafka for buffering
Spark for processing
(live statistics)
![Page 152: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/152.jpg)
USE CASESSmart meter readings
STREAMING
Live weather data
Join 2 live data sources
(Anomaly Detection)
![Page 153: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/153.jpg)
STREAMING
Input data stream
CORE
Batches of processed data
Batches every X seconds
![Page 154: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/154.jpg)
STREAMINGInput data streams
CORE
Batches of processed data
Batches every X seconds
RRR
![Page 155: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/155.jpg)
DSTREAM(Discretized Stream)
Block #1
RDD @ T=5
Block #2 Block #3
Batch interval = 5 seconds
Block #1
RDD @ T=10
Block #2 Block #3
T = 5 T = 10InputDStream
One RDD is created every 5 seconds
![Page 156: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/156.jpg)
TRANSFORMING DSTREAMS
Block #1 Block #2 Block #3
Part. #1 Part. #2 Part. #3
Part. #1 Part. #2 Part. #3
5 sec
wordsRDD
flatMap()
linesRDD
linesDStream
wordsDStream
![Page 157: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/157.jpg)
from pyspark import SparkContextfrom pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 secondsc = SparkContext("local[2]", "NetworkWordCount")ssc = StreamingContext(sc, 5)
# Create a DStream that will connect to hostname:port, like localhost:9999linesDStream = ssc.socketTextStream("localhost", 9999)
# Split each line into wordswordsDStream = linesDStream.flatMap(lambda line: line.split(" "))
# Count each word in each batchpairsDStream = wordsDStream.map(lambda word: (word, 1))wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the consolewordCountsDStream.pprint()
ssc.start() # Start the computationssc.awaitTermination() # Wait for the computation to terminate
linesStream
wordsStream
pairsStream
wordCountsStream
![Page 158: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/158.jpg)
Terminal #1
Terminal #2
$ nc -lk 9999
hello hello world
$ ./network_wordcount.py localhost 9999
. . .--------------------------Time: 2015-04-25 15:25:21--------------------------(hello, 2)(world, 1)
![Page 159: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/159.jpg)
ExRDD, P1
W
Driver
RDD, P2
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T ExRDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Batch interval = 600 ms
R
![Page 160: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/160.jpg)
ExRDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T ExRDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
200 ms later
Ex
W
block, P2
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
![Page 161: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/161.jpg)
ExRDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T ExRDD, P1
W
RDD, P2
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
200 ms later
Ex
W
block, P2
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
block, P3
block, P3
![Page 162: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/162.jpg)
ExRDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T ExRDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Ex
W
RDD, P2T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3
![Page 163: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/163.jpg)
ExRDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T ExRDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Ex
W
RDD, P2T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3
![Page 164: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/164.jpg)
1.4.0
New UI for Streaming
![Page 165: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/165.jpg)
1.4.0
DAG Visualization for Streaming
![Page 166: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/166.jpg)
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Batch interval = 600 ms
Ex
W
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
R
block, P1
2 input DStreams
![Page 167: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/167.jpg)
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Ex
W
block, P1 T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
R
block, P1
Batch interval = 600 ms
block, P2
block, P3
block, P2
block, P3
block, P2
block, P3
block, P2
block, P3
![Page 168: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/168.jpg)
Ex
W
Driver
RDD, P1
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T Ex
W
RDD, P1
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Ex
W
RDD, P1 T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
R
RDD, P1
Batch interval = 600 ms
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2 RDD, P3
Materialize!
![Page 169: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/169.jpg)
Ex
W
Driver
RDD, P3
T R
Internal
Threads
SSD
SSD
OS Disk
T T
T T Ex
W
RDD, P4
T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
T
Ex
W
RDD, P3 T
Internal
Threads
SSD
SSD
OS Disk
T T
T T
R
RDD, P6
Batch interval = 600 ms
RDD, P4
RDD, P5
RDD, P2
RDD, P2
RDD, P5
RDD, P1
RDD, P1 RDD, P6
Union!
![Page 170: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/170.jpg)
BASIC ADVANCED- File systems- Socket Connections
- Kafka- Flume- Twitter
Sources directly available in StreamingContext API
Requires linking against extra dependencies
CUSTOM- Anywhere
Requires implementing user-defined receiver
![Page 171: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/171.jpg)
![Page 172: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/172.jpg)
![Page 173: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/173.jpg)
TRANSFORMATIONS ON DSTREAMS
map( )
flatMap( )
filter( )
repartition(numPartition
s)
union(otherStream)
count()
reduce( )
countByValue()
reduceAByKey( ,[numTasks])
join(otherStream,[numTasks]) cogroup(otherStream,
[numTasks])
transform( )RDD
RDD
updateStateByKey( )
*
![Page 174: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/174.jpg)
OUTPUT OPERATIONS ON DSTREAMS
print()
saveAsTextFile(prefix, [suffix])foreachRDD( )
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
![Page 175: Intro to Spark development](https://reader036.vdocuments.us/reader036/viewer/2022062412/58f9a967760da3da068b6e3c/html5/thumbnails/175.jpg)