Download - Anatomy of spark catalyst
![Page 1: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/1.jpg)
Anatomy of Spark Catalyst- Part 1
Deep dive into Spark Catalyst API
https://github.com/phatak-dev/anatomy-of-spark-catalyst
![Page 2: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/2.jpg)
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
![Page 3: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/3.jpg)
Agenda● DataFrame Internals● Introduction to Catalyst ● Catalyst Trees● Expressions● Datatypes● Row and Internal Row● Custom expressions● Codegen in expressions● Janino
![Page 4: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/4.jpg)
DataFrame internals● Each dataframe is internally represented as logical
plans in spark● These logical plan converts into physical plans to
execute on RDD abstraction● All plans are represented using tree structure● These trees and optimization of trees come from
catalyst● Ex : DFExample.scala
![Page 5: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/5.jpg)
Code organization of Spark SQL● The code of spark sql is organized into below projects
○ Catalyst○ Core○ Hive○ Hive-Thrift server○ unsafe(external to spark sql)
● We are focusing on Catalyst code in this session● Core depends on catalyst ● Catalyst depends upon unsafe for tungsten code
![Page 6: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/6.jpg)
Introduction to Catalyst● An implementation agnostic framework for manipulating
trees of relational operators and expressions● It defines the all the expressions, logical plans and
optimizations API’s for spark SQL● Catalyst API is independent of RDD evaluation. So
catalyst operators and expressions can be evaluated without RDD abstraction
● Introduced in Spark 1.3 version
![Page 7: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/7.jpg)
Catalyst in Spark SQL
HiveQL
Hive parser
Hive queries
SparkQL
SparkSQL Parser
Spark SQL queries
Dataframe DSL
DataFrame
Catalyst
Spark RDD code
![Page 8: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/8.jpg)
Tree API
![Page 9: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/9.jpg)
Catalyst Trees● Everything in catalyst is a tree● Tree is a very powerful abstractions for defining
hierarchical operators and expressions● The important abstraction defined is TreeNode● Every TreeNode has its own type● Catalyst defines many helpers functions like forall,
transform which provides collection like API’s for TreeNode
● These trees can be used in user programs● Ex: org.apache.spark.sql.catalyst.TreeNode
![Page 10: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/10.jpg)
Expression trees● We can define an arithmetic expression tree using
catalyst tree abstraction● Each tree node in the tree extends from a trait
ArithmeticExpression● We will have following different kind of nodes
○ LeafValue○ ADD○ Divide○ MUL
● com.madhukaraphatak.spark.catalyst.trees
![Page 11: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/11.jpg)
Creating Expression Trees● In this example, we see how to use the earlier defined
tree● Let’s take below expression 10+20● It can be represented using the below tree
ADD(LeafValue(10), LeafValue(20))● Calling evaluate on the root of tree gives us the result● Ex : ExpressionExample
![Page 12: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/12.jpg)
Traversing trees● Catalyst provides collection like methods for traversing
trees● Using these methods we can easily transverse our
expression trees● Collection like API’s allows us to use scala pattern
matching for work with defined node.● In our example, we will be using transform operation to
change the evaluation for the divide by 0 operation.● Ex : ExpressionTransformation
![Page 13: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/13.jpg)
Optimising expressions● One of the nice property of collection like API’s are
ability to optimize the trees even before they are evaluated
● In our example, we will use transform and transformUp for optimizing tree in case of multiplication
● In this example, we optimize whenever there is multiplication zero
● Ex : ExpressionOptimization
![Page 14: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/14.jpg)
Understanding transformUp
MUL
LeafValue
(19) MUL
LeafValue(19) LeafValue(0)
MUL
LeafValue
(19)
LeafValue
(0)
LeafValue
(0)transformUp transformUp
![Page 15: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/15.jpg)
Catalyst Expressions
![Page 16: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/16.jpg)
Catalyst Expression ● As we have discussed our simple expressions, catalyst
defines expressions for all different kinds of sql operations
● The base trait is Expression which extends TreeNode● Different kinds of expressions are
○ LeafExpression○ UnaryExpression○ BinaryExpression○ MathExpression
● org.apache.spark.sql.catalyst.Expression
![Page 17: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/17.jpg)
Simple Expressions● In this example, we look at how to represent some of
simple spark sql expressions in catalyst API● We are going at below ones
○ Literals (Leaf expression)○ Unary Minus ( Unary expression)○ ADD expression ( Binary expression)○ Cos expression ( Math expression)
● We can evaluate all these expressions without sparkcontext or RDD.
● Ex : SimpleExpression
![Page 18: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/18.jpg)
Datatypes of Catalyst● All expressions in catalyst take typed inputs and returns
typed output● All these types are represented using DataTypes● The supported data type are
○ Native (Long,Double,String etc)○ Complex (Struct Field, Array,Map)○ Container (Struct Type)
● All these internally mapped to Java datatypes● org.apache.spark.sql.types.DataType
![Page 19: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/19.jpg)
Row● Row is abstraction to represent the data operated by
expressions● Row represents the sequence of values with order in
which schema is defined.● There are multiple implementation of Row available
which is optimized for specific use cases● Row plays a central role in moving from Dataframe
world to RDD world as DataFrame is represented as RDD[Row]
● Ex : RowExample
![Page 20: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/20.jpg)
InternalRow and Unsafe Row● InternalRow is representation of dataframe row for
internal usage● The separation between user facing Row and Internal
row is done in 1.4 version,to support custom memory management
● In normal implementation, Row and InternalRow both behave similarly
● But in custom memory management, they differ a lot.● Ex : InternalRowExample
![Page 21: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/21.jpg)
Complex expressions● Our earlier expression only worked on literal values. ● As we now know, what an expression works on, we can
understand more complex expressions● Some of the expression we are going look at
○ BoundReference○ Add with BoundReference○ Alias
● Ex : ComplexExpressions
![Page 22: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/22.jpg)
Custom Expression example● Not only we can use the existing expressions, we can
add our own custom expression too.● As of now, these adding custom expressions is a private
API so you code has live in org.apache.spark.sql package
● In our example, we can see how to add our own custom add implementation.
● Ex: CustomAddExpression
![Page 23: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/23.jpg)
Catalyst CodeGen
![Page 24: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/24.jpg)
Expression Evaluation● Every expression has to be evaluated in runtime, to
produce appropriate value. We used eval till now.● Most of the time in spark sql, is spent on expression
evaluation● Expression evaluation is compute bound work● As the speed of individual core is not speeding up,
speeding up these evaluation will improve the overall time
● So from 1.4, spark started generating code to evaluate expression
![Page 25: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/25.jpg)
CodeGen● Each expression comes with a method called genCode
() which refers to the java code to be generated to evaluate
● This generate code can use primitive types and mutable values rather than boxed values and immutable data structures
● These expressions are compiled in runtime using Janino compiler
● We can use CodeGenFallback trait for skipping codegen.● Ex : ExpressionCodeGenExample
![Page 26: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/26.jpg)
Janino● Janino is super small super fast java compiler● It’s a runtime library for compiling java code.● Janino is used by many frameworks like Groovy,
LogBack etc● Spark uses janino to compile the expression code into
byte code and uses that bytecode for evaluating the expression
● In 2.0, this code generation is applied to whole stage rather than single expression at a time
● Ex : JaninoExample
![Page 27: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/27.jpg)
Recap● Trees● Expressions● Data types● Row and Internal Row● Eval● CodeGen● Janino
![Page 28: Anatomy of spark catalyst](https://reader031.vdocuments.us/reader031/viewer/2022021506/587138e11a28abf0568b6493/html5/thumbnails/28.jpg)
References● Dataframe Anatomy https://www.youtube.com/watch?
v=iKOGBr-kOks● Tungsten Anatomy
https://www.youtube.com/watch?v=7nIMpD5TyNs● Janino
http://janino-compiler.github.io/janino/