profiling & testing with spark

Profiling & Testing with SparkApache Spark 2.0 Improvements, Flame Graphs & Testing

Outline

Overview

Spark 2.0 Improvements

Profiling with Flame Graphs

How-to Flame Graphs

Testing in Spark

Overview

Apache Spark™ is a fast and general engine for large-scale data processing

Speed: Runs in-memory computing, up to 100x faster than MapReduce

Ease of Use: Support for Java, Scala, Python and R binding

Generality: Enabled for SQL, Streaming and complex analytics (ML)

Portable: Runs on Yarn, Mesos, standalone or Cloud

Overview (Big Picture)

Overview (architecture)

Overview (code sample)

Monte-carlo π calculation

“This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”

Main Takeaway

Spark SQL:

Provides parallelism, affordable at scale

Scale out SQL on storage for Big Data volumes

Scale out on CPU for memory-intensive queries

Offloading reports from RDBMS becomes attractive

Spark 2.0 improvements:

Considerable speedup of CPU-intensive queries

Spark 2.0 Improvements

SQL Queries

sqlContext.sql("

SELECT a.bucket, sum(a.val2) tot

FROM t1 a, t1 b

WHERE a.bucket=b.bucket and

a.val1+b.val1<1000

GROUP BY a.bucket

ORDER BY a.bucket").show()

Complex and resource-intensive SELECT statement:

EXPLAIN directive (execution plan)

Execution Plan

The execution plan:

First instrumentation point for SQL tuning

Shows how Spark wants to execute the query (break-down)

Main players:

Catalyst: the query optimizer

Catalyst (query optimizer)

Logical Plan:

Describes computation on data sets without defining how to conduct it

Physical Plan:

Defines which computation to conduct on each dataset

Project Tungsten (Goal)

“Improves the memory and CPU efficiency of Spark backend

execution by pushing performance close to the limits of

modern hardware.”

Project Tungsten

Perform manual memory management instead of relying on Java objects:

Reduce memory footprint

Eliminate garbage collection overheads

Use java.unsafe and off-heap memory

Code generation for expression evaluation:

Reduce virtual function calls and interpretation overhead (JVM)

Project Tungsten (Code-Gen)


The Volcano Iterator Model:

Standard for 30 years: almost all

databases do it.

Each operator is an “iterator” that

consumes records from its input

operator


Downside the Volcano Iterator Model:

Too many virtual function calls

at least 3 calls for each row in Aggregate phase

Can’t take advantage of modern CPU features

pipelining, prefetching, branch prediction,

SIMD, ILP, exploit instruction cache ...


What if we hire a college freshman to implement this query in Java in 10 mins?

Whole-stage Code-Gen: Spark as a “Compiler”


A student beating 30 years of science ...


Volcano

● Many virtual function calls

● Data in memory (or cache)

● No loop unrolling, SIMD

Hand-written code

● No virtual function calls

● Data in CPU registers

● Exploit compiler optimizations

○ loop unrolling, SIMD, pipeliningTake advantage of all the information that is known after query compilation

Execution plan comparison (legacy vs whole stage code-gen)

WholeStageCodeGen

Profiling with Flame Graphs

Root Cause Analysis

Benchmarking:

Run the workload and measure it with the relevant diagnostic tools

Goals: understand the bottleneck(s) and find root causes

Limitations:

Our tools & time available for analysis are limiting factors

Profiling CPU-Bound workloads

Flame graph visualization of stack profiles:

● Brain child of Brendan Gregg (Dec 2011)

● Code: https://github.com/brendangregg/FlameGraph

● Now very popular, available for many languages, also for JVM

Shows which parts of the code are hot

● Very useful to understand where CPU cycles are spent

https://github.com/brendangregg/FlameGraph

Flame Graph Visualization

Recipe:

● Gather multiple stack traces

● Aggregate them by sorting alphabetically by function/method name

● Visualization using stacked colored boxes

● Length of the box proportional to time spent there

Flame Graph (Spark 1.6)

Flame Graph (Spark 2.0)

Spark CodeGen vs. Volcano

Code generation improves CPU-intensive workloads

● Replaces loops and virtual function calls with code generated for the query

● The use of vector operations (e.g. SIMD) also beneficial

● Codegen is crucial for modern in-memory DBs

Commercial RDBMS engines

● Typically use the slower volcano model (with loops and virtual function calls)

● In the past optimizing for I/O latency was more important, now CPU cycles matter

more

Flame Graphs

Pros: good to understand where CPU cycles are spent

● Useful for performance troubleshooting

● Functions at the top of the graph are the ones using CPU

● Parent methods/functions provide context

Limitations:

● Off-CPU and wait time not charted (experimental)

● Interpretation of flame graphs requires experience/knowledge

● Not included in Spark monitoring suite

How-to Flame Graphs

CERN Java Flight Recorder Approach (1/2)

Enable Java Flight Recorder (JFR)

● Extra options in spark-defaults.conf or CLI. Example:

Collect data with jcmd:

● Example, sampling for 10 sec:

CERN Java Flight Recorder Approach (2/2)

Process the jfr file:

● From .jfr to merged stacks

● Produce the .svg file with the flame graph

● Find details in Kay Ousterhout’s article:

https://gist.github.com/kayousterhout/7008a8ebf2bab

eedc7ce6f8723fd1bf4

https://gist.github.com/kayousterhout/7008a8ebf2bab

PayPal Approach

https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-

spark-applications-using-flame-graphs/

https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/

CERN HProfiler Approach

HProfiler (CERN home-built tool)

● Automates collection and aggregation of stack traces into flame graphs for

distributed applications

● Integrates with YARN to identify the processes to trace across the cluster

Based on Linux perf_events stack sampling (bare metal)

Experimental tool

● Author Joeri Hermans @ CERN

● https://github.com/cerndb/Hadoop-Profiler

● Hadoop-performance-troubleshooting-stack-tracing

https://github.com/cerndb/Hadoop-Profiler

https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

Testing in Spark

Testing in Spark

● Why to run Spark outside of a cluster

● What to test

● Running Local

● Running as a Unit Test

● Data Structures

Testing in Spark

Why to run Spark outside of a cluster

● Time

● Trusted Deployment

● Money

Testing in Spark

What to test

● Experiments

● Complex logic

● Data samples

● Business generated scenarios

Testing in Spark (Running Local)

Running Local

● A test doesn’t always need to be a unit test

● UIs like Zeppelin is OK for quick feedback

but lacks from IDE Features

● Running local in your IDE is priceless

Testing in Spark (Running Local)

Example

● Use runLocal flag to set a local SparkContext

● Separate out testable work from driver code

Testing in Spark (Unit Testing)

Example

FunSuite: TDD unit testing suite

for Scala

http://www.scalatest.org/getting_started_with_fun_suite

Testing in Spark (Data Structures)

Working with “hand-written” DataFrames:

Testing in Spark (Hive)

Testing with Hive:

● Spin-up a docker-hive container for Apache Hive (Big Data Europe)

● Enables real interaction allowing to:

○ create, delete, write, ...

https://github.com/big-data-europe/docker-hive


Putting Hive + Spark together:

● Create a custom hive-site.xml

● Start Spark with the provided hive-site.xml

○ spark-shell --files /PATH/hive-site.xml


Start Spark with the provided hive-site.xml:

Testing in Spark (Mini-Clusters)

Mini-Clusters

● Hadoop-mini-cluster

● Spark-unit-testing-with-hdfs

● Support for:

○ HBase & Hive

○ Kafka & Storm

○ Zookeeper

○ HDFS

○ ...access HDFS files & test code

copy files from localFS to HDFS

https://github.com/sakserv/hadoop-mini-clusters

https://zedar.gitbooks.io/spark-hadoop-notes/content/spark_unit_testing_with_hdfs.html

Conclusions

Conclusions

Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)

● Scalability and performance on commodity HW

● Spark SQL useful for offloading queries from traditional RDBMS

● code generation speeds up to one order of magnitude on CPU-bound workloads

Diagnostics

● Profiling tools are important in MPP world

● Execution plans analyzed with flame graphs

● Cons: Very immature solutions

Testing

● Testing locally saves time, money and takes advantage of the IDE features

● Elegant ways to test a code by using local SparkContext

● Easy ways to recreate environments for testing real interactions (such Hadoop)

Profiling & Testing with Spark

THANK YOU!

References

● Deep-dive-into-catalyst-apache-spark-2.0

● http://es.slideshare.net/databricks/spark-performance-whats-next

● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf

● http://www.brendangregg.com/flamegraphs.html

● http://db-blog.web.cern.ch/

● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska

http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120

http://es.slideshare.net/databricks/spark-performance-whats-next

https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf

http://www.brendangregg.com/flamegraphs.html

http://db-blog.web.cern.ch/

http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska

profiling & testing with spark

Software