profiling & testing with spark
TRANSCRIPT
![Page 1: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/1.jpg)
Profiling & Testing with SparkApache Spark 2.0 Improvements, Flame Graphs & Testing
![Page 2: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/2.jpg)
Outline
Overview
Spark 2.0 Improvements
Profiling with Flame Graphs
How-to Flame Graphs
Testing in Spark
![Page 3: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/3.jpg)
Overview
Apache Spark™ is a fast and general engine for large-scale data processing
Speed: Runs in-memory computing, up to 100x faster than MapReduce
Ease of Use: Support for Java, Scala, Python and R binding
Generality: Enabled for SQL, Streaming and complex analytics (ML)
Portable: Runs on Yarn, Mesos, standalone or Cloud
![Page 4: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/4.jpg)
Overview (Big Picture)
![Page 5: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/5.jpg)
Overview (architecture)
![Page 6: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/6.jpg)
Overview (code sample)
Monte-carlo π calculation
“This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”
![Page 7: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/7.jpg)
Main Takeaway
Spark SQL:
Provides parallelism, affordable at scale
Scale out SQL on storage for Big Data volumes
Scale out on CPU for memory-intensive queries
Offloading reports from RDBMS becomes attractive
Spark 2.0 improvements:
Considerable speedup of CPU-intensive queries
![Page 8: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/8.jpg)
Spark 2.0 Improvements
![Page 9: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/9.jpg)
SQL Queries
sqlContext.sql("
SELECT a.bucket, sum(a.val2) tot
FROM t1 a, t1 b
WHERE a.bucket=b.bucket and
a.val1+b.val1<1000
GROUP BY a.bucket
ORDER BY a.bucket").show()
Complex and resource-intensive SELECT statement:
EXPLAIN directive (execution plan)
![Page 10: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/10.jpg)
Execution Plan
The execution plan:
First instrumentation point for SQL tuning
Shows how Spark wants to execute the query (break-down)
Main players:
Catalyst: the query optimizer
![Page 11: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/11.jpg)
Catalyst (query optimizer)
Logical Plan:
Describes computation on data sets without defining how to conduct it
Physical Plan:
Defines which computation to conduct on each dataset
![Page 12: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/12.jpg)
Project Tungsten (Goal)
“Improves the memory and CPU efficiency of Spark backend
execution by pushing performance close to the limits of
modern hardware.”
![Page 13: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/13.jpg)
Project Tungsten
Perform manual memory management instead of relying on Java objects:
Reduce memory footprint
Eliminate garbage collection overheads
Use java.unsafe and off-heap memory
Code generation for expression evaluation:
Reduce virtual function calls and interpretation overhead (JVM)
![Page 14: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/14.jpg)
Project Tungsten (Code-Gen)
![Page 15: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/15.jpg)
Project Tungsten (Code-Gen)
The Volcano Iterator Model:
Standard for 30 years: almost all
databases do it.
Each operator is an “iterator” that
consumes records from its input
operator
![Page 16: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/16.jpg)
Project Tungsten (Code-Gen)
Downside the Volcano Iterator Model:
Too many virtual function calls
at least 3 calls for each row in Aggregate phase
Can’t take advantage of modern CPU features
pipelining, prefetching, branch prediction,
SIMD, ILP, exploit instruction cache ...
![Page 17: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/17.jpg)
Project Tungsten (Code-Gen)
What if we hire a college freshman to implement this query in Java in 10 mins?
![Page 18: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/18.jpg)
Whole-stage Code-Gen: Spark as a “Compiler”
![Page 19: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/19.jpg)
Project Tungsten (Code-Gen)
A student beating 30 years of science ...
![Page 20: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/20.jpg)
Project Tungsten (Code-Gen)
Volcano
● Many virtual function calls
● Data in memory (or cache)
● No loop unrolling, SIMD
Hand-written code
● No virtual function calls
● Data in CPU registers
● Exploit compiler optimizations
○ loop unrolling, SIMD, pipeliningTake advantage of all the information that is known after query compilation
![Page 21: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/21.jpg)
Execution plan comparison (legacy vs whole stage code-gen)
WholeStageCodeGen
![Page 22: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/22.jpg)
Profiling with Flame Graphs
![Page 23: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/23.jpg)
Root Cause Analysis
Benchmarking:
Run the workload and measure it with the relevant diagnostic tools
Goals: understand the bottleneck(s) and find root causes
Limitations:
Our tools & time available for analysis are limiting factors
![Page 24: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/24.jpg)
Profiling CPU-Bound workloads
Flame graph visualization of stack profiles:
● Brain child of Brendan Gregg (Dec 2011)
● Code: https://github.com/brendangregg/FlameGraph
● Now very popular, available for many languages, also for JVM
Shows which parts of the code are hot
● Very useful to understand where CPU cycles are spent
![Page 25: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/25.jpg)
Flame Graph Visualization
Recipe:
● Gather multiple stack traces
● Aggregate them by sorting alphabetically by function/method name
● Visualization using stacked colored boxes
● Length of the box proportional to time spent there
![Page 26: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/26.jpg)
Flame Graph (Spark 1.6)
![Page 27: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/27.jpg)
Flame Graph (Spark 2.0)
![Page 28: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/28.jpg)
Spark CodeGen vs. Volcano
Code generation improves CPU-intensive workloads
● Replaces loops and virtual function calls with code generated for the query
● The use of vector operations (e.g. SIMD) also beneficial
● Codegen is crucial for modern in-memory DBs
Commercial RDBMS engines
● Typically use the slower volcano model (with loops and virtual function calls)
● In the past optimizing for I/O latency was more important, now CPU cycles matter
more
![Page 29: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/29.jpg)
Flame Graphs
Pros: good to understand where CPU cycles are spent
● Useful for performance troubleshooting
● Functions at the top of the graph are the ones using CPU
● Parent methods/functions provide context
Limitations:
● Off-CPU and wait time not charted (experimental)
● Interpretation of flame graphs requires experience/knowledge
● Not included in Spark monitoring suite
![Page 30: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/30.jpg)
How-to Flame Graphs
![Page 31: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/31.jpg)
CERN Java Flight Recorder Approach (1/2)
Enable Java Flight Recorder (JFR)
● Extra options in spark-defaults.conf or CLI. Example:
Collect data with jcmd:
● Example, sampling for 10 sec:
![Page 32: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/32.jpg)
CERN Java Flight Recorder Approach (2/2)
Process the jfr file:
● From .jfr to merged stacks
● Produce the .svg file with the flame graph
● Find details in Kay Ousterhout’s article:
https://gist.github.com/kayousterhout/7008a8ebf2bab
eedc7ce6f8723fd1bf4
![Page 33: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/33.jpg)
PayPal Approach
https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-
spark-applications-using-flame-graphs/
![Page 34: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/34.jpg)
CERN HProfiler Approach
HProfiler (CERN home-built tool)
● Automates collection and aggregation of stack traces into flame graphs for
distributed applications
● Integrates with YARN to identify the processes to trace across the cluster
Based on Linux perf_events stack sampling (bare metal)
Experimental tool
● Author Joeri Hermans @ CERN
● https://github.com/cerndb/Hadoop-Profiler
● Hadoop-performance-troubleshooting-stack-tracing
![Page 35: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/35.jpg)
Testing in Spark
![Page 36: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/36.jpg)
Testing in Spark
● Why to run Spark outside of a cluster
● What to test
● Running Local
● Running as a Unit Test
● Data Structures
![Page 37: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/37.jpg)
Testing in Spark
Why to run Spark outside of a cluster
● Time
● Trusted Deployment
● Money
![Page 38: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/38.jpg)
Testing in Spark
What to test
● Experiments
● Complex logic
● Data samples
● Business generated scenarios
![Page 39: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/39.jpg)
Testing in Spark (Running Local)
Running Local
● A test doesn’t always need to be a unit test
● UIs like Zeppelin is OK for quick feedback
but lacks from IDE Features
● Running local in your IDE is priceless
![Page 40: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/40.jpg)
Testing in Spark (Running Local)
Example
● Use runLocal flag to set a local SparkContext
● Separate out testable work from driver code
![Page 41: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/41.jpg)
Testing in Spark (Unit Testing)
Example
FunSuite: TDD unit testing suite
for Scala
![Page 42: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/42.jpg)
Testing in Spark (Data Structures)
Working with “hand-written” DataFrames:
![Page 43: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/43.jpg)
Testing in Spark (Hive)
Testing with Hive:
● Spin-up a docker-hive container for Apache Hive (Big Data Europe)
● Enables real interaction allowing to:
○ create, delete, write, ...
![Page 44: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/44.jpg)
Testing in Spark (Hive)
Putting Hive + Spark together:
● Create a custom hive-site.xml
● Start Spark with the provided hive-site.xml
○ spark-shell --files /PATH/hive-site.xml
![Page 45: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/45.jpg)
Testing in Spark (Hive)
Start Spark with the provided hive-site.xml:
![Page 46: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/46.jpg)
Testing in Spark (Mini-Clusters)
Mini-Clusters
● Hadoop-mini-cluster
● Spark-unit-testing-with-hdfs
● Support for:
○ HBase & Hive
○ Kafka & Storm
○ Zookeeper
○ HDFS
○ ...access HDFS files & test code
copy files from localFS to HDFS
![Page 47: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/47.jpg)
Conclusions
![Page 48: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/48.jpg)
Conclusions
Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)
● Scalability and performance on commodity HW
● Spark SQL useful for offloading queries from traditional RDBMS
● code generation speeds up to one order of magnitude on CPU-bound workloads
Diagnostics
● Profiling tools are important in MPP world
● Execution plans analyzed with flame graphs
● Cons: Very immature solutions
Testing
● Testing locally saves time, money and takes advantage of the IDE features
● Elegant ways to test a code by using local SparkContext
● Easy ways to recreate environments for testing real interactions (such Hadoop)
![Page 49: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/49.jpg)
Profiling & Testing with Spark
THANK YOU!
![Page 50: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/50.jpg)
References
● Deep-dive-into-catalyst-apache-spark-2.0
● http://es.slideshare.net/databricks/spark-performance-whats-next
● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
● http://www.brendangregg.com/flamegraphs.html
● http://db-blog.web.cern.ch/
● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
![Page 51: Profiling & Testing with Spark](https://reader031.vdocuments.us/reader031/viewer/2022022419/589e77021a28ab300b8b5693/html5/thumbnails/51.jpg)
Q & A