profiling spark applications - schedschd.ws/.../87/apache-big-data-2017-spark-profiling.pdf · 5...
TRANSCRIPT
1 © 2016, Conversant, LLC. All rights reserved.
PROFILING SPARK APPLICATIONS
APACHE BIG DATA NORTH AMERICA 2017 PRESENTED BY:JAYESH THAKRARSENIOR SOFTWARE ENGINEER
2
The Quest For Spark Profiling...
3
IT ALL BEGAN WITH THE SPARK WEB UI...
4
THE MISSING SUMMARY...
5
SUMMARY FOR SPARK APPS? BUT SPARK APPS ARE A MIXED BAG...
• SimplebatchtocomplexETLandhighlyiterativeML
• Conditionaldataflows
• Multipleinputs,multipleoutputs
• Executors,jobs,stages,tasks
6
BUT STILL NEEDED FOROPERATIONS, TUNING AND TROUBLESHOOTING
• Whydidmyapptake2xnormaltimelastnight?
• ImpactofJVMtuningandotherchanges?
• Impactofinput/outputformats?
• Impactofconfigurationorotherchanges?
7
BTW, WHAT'S PROFILING?
Source:https://en.wikipedia.org/wiki/Profiling_(computer_programming)
8
Introducing.....Your Events
9
WHAT DRIVES SPARK UI AND HISTORY SERVER?
10
SPARK EVENTS
TaskTaskDriver+Listener
EventBusConfigurationParameters
spark.eventLog.enabled =true
spark.eventLog.dir =hdfs://<dir>
{"Event":"SparkListenerLogStart","SparkVersion":"1.6.0"}{"Event":"SparkListenerBlockManagerAdded","BlockManagerID":{"ExecutorID":"driver","Host":"10.110.104.43","Port":33287},"MaximumMemory":556038881,"Timestamp":1481061154984}
SampleEvents
11
SPARK CONFIGURATION
12
SPARK EVENTS FRAMEWORK
• Eventlogging=built-in– athreadinthedriver
• Abilitytocreateandregistercustomlisteners
sparkContext.addSparkListener(listener:SparkListenerInterface)
• StatsReportListener =Tasksummaryaftereachstage
• StatsReportListener =Alsoavailableforstreaming
• Seeorg.apache.spark.scheduler.SparkListener fordetails
13
And That Leads To...
14
SPARKPROFILER PROJECT ON GITHUBhttps://github.com/conversant/spark-profiler
15
CONVERSANT: MORE TOOLS AND STUFF....
http://engineering.conversantmedia.com/posts/https://github.com/conversant
16
SPARKPROFILER PROJECT OVERVIEW
ApplicationHierarchy
Application
Job
Stage
Task
17
SPARKPROFILER PROJECT: PARSER
18
SPARKPROFILER PROJECT: PROFILER
19
SPARKPROFILER PROJECT: SUMMARYGENERATOR
20
KEY ENTITY IN SPARKPROFILER : TASK
• taskDuration
• peakMemory
• inputRows
• outputRows
• resultSize
• bytesRead
• recordsRead
• shuffleBytesWritten
• shuffleRecordsWritten
• remoteBlocksFetched
• localBlocksFetched
• remoteBytesRead
• localBytesRead
• totalRecordsRead
MetricsIdentifyingAttributes
• ApplicationName
• ApplicationId
• JobId
• StageId
• StageAttemptId
• TaskId
• AttemptId
21
PROFILING OVERVIEW
Runsparkapplication
{"Event":"SparkListenerLogStart","SparkVersion":"1.6.0"}{"Event":"SparkListenerBlockManagerAdded","BlockManagerID":{"ExecutorID":"driver","Host":"10.110.104.43","Port":33287},"MaximumMemory":556038881,"Timestamp":1481061154984}
SummaryGenerator
SavetoDatastore
EventsOutputFile
InteractiveAnalysis
22
SUMMARY GENERATOR
• Programtoanalyze
(profile)jobs,stages
andtasksandprovide
summary
• Summaryofall
metricsacrossalltasks
23
INTERACTIVE ANALYSIS
• Troubleshooting:Whichjobs/stagesareexpensive?Comparetime,input/output,shufflevolume,memory,etc.betweendifferentjobsandstagesofasinglerun
• Tuning:Comparedifferentruns• Impactofchanginginput/outputformats• Runtimevariations• Performancetuningandoptimization
e.g.JVMtuning,parallelism,compression
24
POTENTIAL FUTURE ENHANCEMENTS
• RDDprofiling/analysis
• Dynamicexecutors
• Handlingfailedjobs,stages,tasks
• Analysisofstreamingjobs
• Automaticadvisorfortuning/optimization
• Visualizations– e.g.usingZepplin
• Saveeventsandsummaryfor
§ Historicalanalysis
§ Clusterutilizationoverview
§ Sizingandprediction
25
Questions?
26