using spark efficiently - garrens.com€¦ · 29/07/2017 · org.apache.spark.sql.dataset[t] using...
TRANSCRIPT
DATASETS | DATAFRAMES | SQLUSING SPARK EFFICIENTLY
NAME: GARREN STAUBLI
CONSULTANT @ BLUEPRINT CONSULTING SERVICES
SPECIALTY: BIG DATA ENGINEERING
1 © 2017 Garren Staubli | Presentation resources: garrens.com/spark729
DISCLAIMERS
2 © 2017 Garren Staubli | Presentation resources: garrens.com/spark729
I am not a Robot.
Non-OCD friendly formatting may follow.
The views expressed in this presentation probably belong to various different companies to whom I have sold my soul unwittingly by mindlessly accepting Terms of Service and Use.
USING SPARK EFFICIENTLY
3
Spark Core:• RDD (Resilient Distributed Dataset)
Spark SQL:• Dataset (e.g. Dataset[CountryGDP])
• Typed templates• Matching schema• Relational and functional operations• Lazily Evaluated
• DataFrame (aka Dataset[Row])• Uses generic Row object
DISTRIBUTED DATA STRUCTURES
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7294
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized
Parallelized
Immutable
Encoded
Strongly Typed
DATASETS
O(PIES)“A DATASET IS A STRONGLY
TYPED COLLECTION OF DOMAIN-SPECIFIC OBJECTS THAT CAN BE
TRANSFORMED IN PARALLEL USING FUNCTIONAL OR
RELATIONAL OPERATIONS.”
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7295
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
DATASETSO(PIES)
OptimizedStorage TungstenExecution Catalyst
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7296
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Storage] => TungstenDATASETS
O(PIES)
“JAVA OBJECTS HAVE A LARGE INHERENT MEMORY OVERHEAD.”
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7297
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Storage] => TungstenDATASETS
O(PIES)
Tungsten Objectives:
1. Improve CPU and Memory efficiency
2. Push performance closer to bare metal
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7298
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Storage] => TungstenDATASETS
O(PIES)
Memory Management and Binary Processing• Manage memory explicitly • Eliminate JVM object model overhead• Reduce garbage collection
Cache-aware computation• Algorithms and data structures to exploit memory hierarchy
Code generation• Using code generation to exploit modern compilers and CPUs
© 2017 Garren Staubli | Presentation resources: garrens.com/spark7299
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Execution] => CatalystDATASETS
O(PIES)
YOUR REQUEST
SPARK CATALYST
EXECUTION PLANNING
GENERATED CODE
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72910
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Execution] => CatalystDATASETS
O(PIES)
YOUR REQUEST
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72911
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Execution] => CatalystDATASETS
O(PIES)
SPARK CATALYST
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72912
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Execution] => CatalystDATASETS
O(PIES)
EXECUTION PLANNING
GENERATED CODE
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72913
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Optimized [Execution] => CatalystDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72914
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
ParallelizedDATASETS
BEFORE AFTER
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72915
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
ParallelizedDATASETS
O(PIES)
Partitions• Logical splits of data
• 64MB like HDFS Blocks
Tasks• Individual unit of execution
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72916
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
ParallelizedDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72917
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
ImmutableDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72918
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
ImmutableDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72919
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
EncodedDATASETS
O(PIES)
Data validation
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72920
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
EncodedDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72921
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Strongly TypedDATASETS
O(PIES)
© 2017 Garren Staubli | Presentation resources: garrens.com/spark72922
org.apache.spark.sql.Dataset[T]
USING SPARK EFFICIENTLY
Strongly TypedDATASETS
O(PIES)
23
• Named Columns
• Weakly typed Row representation
• Similar structures
• Python/R DataFrame
• Relational DB table
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
org.apache.spark.sql.Dataset[Row]DATAFRAMES
24
USING SPARK EFFICIENTLY
DISTRIBUTED DATA STRUCTURES RECAP
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
25
DataFrame
• Exploratory Data Analysis
• Transforming Columns
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
org.apache.spark.sql.Dataset[Row]USE DATASET OR DATAFRAME?
Dataset
• Known object requirements
• Strong typing in IDE
• Needs compilation errors
26
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
CACHINGCaching best practices
- When
- Fits in memory
- Computationally expensive
- Broadcast hint
- Eager vs Lazy (default) cache
27
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
USING SQL
UDF Support • Spark • Hive
Create Permanent Hive table
Access Hive/Temporary Table
28
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
https://i.ytimg.com/vi/p8MEiZGjETE/maxresdefault.jpghttps://en.wikipedia.org/wiki/Donald_Trumphttp://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.htmlhttp://sueduff.com/wp-content/uploads/2014/04/bigstock-Magic-Book-9990005.jpg
CREDITS
Images
Databricks: Datasets: https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.htmlCatalyst: https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huaihttps://de.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120Tungsten: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlMatrix: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.htmlGeneral Reference https://jaceklaskowski.gitbooks.io/mastering-apache-spark/https://spark.apache.org/docs/latest/sql-programming-guide.html
29
USING SPARK EFFICIENTLY
© 2017 Garren Staubli | Presentation resources: garrens.com/spark729
https://seesouthernstars.files.wordpress.com/2015/06/porkey.jpg