spark tuning for enterprise system administrators, spark summit east 2016
TRANSCRIPT
Spark Tuning for Enterprise System Administrators
Anya T. Bida, PhD Rachel B. Warren
Don't worry about missing something...
Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !! !2
About Anya About RachelOperations Engineer !!!
Spark & Scala Enthusiast / Data Engineer
About Alpine Data!alpinenow.com
Alpine deploys Spark in Production for our Enterprise Customers
About You*
Intermittent
Reliable Optimal
Enterprise System Administrators
mySparkApp Success
*
Intermittent Reliable
Optimal
mySparkApp Success
Default != RecommendedExample: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.
!6
Which parameters are important? !
How do I configure them?
!7
Default != Recommended
Filter* data before an
expensive reduce or aggregation
consider* coalesce(
Use* data structures that
require less memory
Serialize*
PySpark
serializing is built-in
Scala/Java?
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
tuning.html#tuning-data-structures
See "Optimize partitions." *
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
Intermittent Reliable
Optimal
mySparkApp Success
mySparkApp memory issues
Shared Cluster
!10
!11
Fair Schedulers
!12
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Fair Schedulers
!13
YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations>
SPARK <allocations> <pool name="sample_queue"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> </allocations>
Configure these parameters too!
Fair Schedulers
!14
YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> !</allocations>
What is the memory limit for mySparkApp?
!15
!16
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !!!
<maxResources>8000 mb</maxResources>
Limitation
What is the memory limit for mySparkApp?
Reserve 25% for overhead.
!17
!18
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
!19
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
What is the memory limit for mySparkApp?
Limitation: Each driver and executor must not be larger than a
single node.
Limitation: Driver and executor memory must not be larger than
a single node.
!(yarn.nodemanager.resource.memory-mb - 1Gb)
executor.memory ~ # executors per node
Limitation
!20
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit !
mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors)
Limitation: maxExecutors should not exceed pool allocation.
!Yarn: <maxResources>8vcores</maxResources>
Limitation
What is the memory limit for mySparkApp?
!21
I want a little more information...Top 5 Mistakes When Writing Spark Applications
by Mark Grover and Ted Malaska of Cloudera http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications
How-to: Tune Your Apache Spark Jobs (Part 2) by Sandy Ryza of Cloudera
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
I want lots more...
!22
Intermittent Reliable
Optimal
mySparkApp Success
mySparkApp memory issues
Shared Cluster
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
mySparkApp memory issues
here let's talk about one scenario
Reduce the memory needed for mySparkApp. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
Gracefully handle memory limitations. How?
mySparkApp memory issues
Reduce the memory needed for mySparkApp. How?
Gracefully handle memory limitations. How?
mySparkApp memory issues
here let's talk about one scenario
Symptoms:
!30
• mySparkApp is running for several hours Container is lost.
• I notice one container fails, then the rest fail one by one
• The first container to fail was the driver • Driver is a SPOF
Investigate:
!31
collect unbounded data to the driver
• Driver failures are often caused by:
• I verified only bounded data is brought to the driver, but still the driver fails intermittently.
Potential Solution: RDD.checkpoint()
!32
Use in these cases: • high-traffic cluster • network blips • preemption • disk space nearly full !!
Function: • saves the RDD to stable
storage (eg hdfs or S3)
How-to: SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
Intermittent Reliable
Optimal
mySparkApp Success
mySparkApp memory issues
Shared Cluster
Instead of 2.5 hours, myApp completes in 1 hour.
Cheat-sheet techsuppdiva.github.io/
Intermittent Reliable
Optimal
mySparkApp Success
mySparkApp memory issues
Shared Cluster
HighPerformanceSpark.com
Further Reading:• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015, O'Reilly
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
• Scheduling:https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
• Tuning the Spark Conf:Mark Grover and Ted Malaska from Cloudera http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applicationsSandy Ryza (Cloudera) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
• Checkpointing:http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
• Troubleshooting:Miklos Christine from Databricks https://spark-summit.org/east-2016/events/operational-tips-for-deploying-spark/
• High Performance Spark by R. Warren, H. Karau, coming in 2016, O'Reilly http://highperformancespark.com/
!36
More Questions?
!37
Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ !!Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren !!