spark autotuning: spark summit east talk by lawrence spracklen
TRANSCRIPT
We use Spark • End-2-end support
– Problem inception through model operationalization
• Extensive use of Spark at all stages – ETL – Feature engineering – Model training – Model scoring
Spark Configuration
Data scientists must set: • Size of driver • Size of executors • Number of executors • Number of partitions
Inefficient • Manual Spark tuning is a time consuming and inefficient
process – Frequently results in “trial-and-error” approach – Can be hours (or more) before OOM fails occur
• Less problematic for unchanging operationalized jobs – Run same analysis every hour/day/week on new data – Worth spending time & resources to fine tune
AutoML
Data Set
Alg #1
Alg #2
Alg #3
Alg #N
Alg #1
Alg #N
1)Investigate N ML algorithms
2) Tune top performing algorithms
Feature engineering
Alg #2
Alg #1
Alg #N
2) Feature elimination
It depends…. • Size & complexity of data • Complexity of algorithm(s)
• Nuances of the implementation
• Cluster size
• YARN configuration
• Cluster utilization
Default settings? • Spark defaults are targeted at toy data sets & small
clusters – Not applicable to real-world data science
• Resource computations often assume a dedicated cluster – Looking at total vCPU and memory resources
• Enterprise clusters are typically shared – Applying dedicated computations is wasteful
Challenges • Algorithm needs to
– Determine compute requirements for current analysis – Reconcile requirements with available resources – Allow users to set specific variables
• Changing one parameter will generally impact others 1. Increase memory per executor
èDecrease number of executors
2. Increase cores per executor èDecrease memory per core
[N.B. Not just a single “correct” configuration!!!]
Understanding the inputs • Sample from input data set(s) • Estimate
– Schema inference – Dimensionality – Cardinality – Row count
• Remember about compressed formats • Offline background scans feasible
– Retain snapshots in metastore • Take into account impact of downstream operations
Algorithmic complexity • Each operation has unique requirements
– Wide range in memory and compute requirements across algorithms
• Provide levers for algorithms to provide input on resource asks – Resource heavy or resource light (Memory & CPU)
• Per algorithm modifiers computed for all algorithms in Application toolbox
Wrapping OSS • Autotuning APIs are made available to user & OSS
algorithms wrapped using our extensions SDK – H20 – MLlib – Etc.
• Provide input on relative memory requirements and computational complexity
YARN Queues • Queues frequently used in enterprise settings
– Control resource utilization
• Variety of queuing strategies – Users bound to queues – Groups bound to queues – Applications bound to queues
• Queues can have own maximum container sizes • Queues can have different scheduling polices
Dynamic allocation • Spark provides support for auto-scaling
– Dynamically add/remove executors according to demand
• Is preemption also enabled? – Minimize wait time for other users
• Still need to determine optimal executor and driver sizing
Query cluster • Query YARN Resource Manager to get cluster state
– Kerberised REST API
• Determine key resource considerations – Preemption enabled? – Queue mapping – Queue type – Queue resources – Real-time node utilization
Executor Sizing • Try to make efficient use of the available resources
– Start with a balanced strategy
• Consume resources based on their relative abundance
MemPerCore = QueueMemQueueCores *MemHungry
CoresPerExecutor =min(MaxUsableMemoryPerContainerMemoryPerCore , 5)
MemPerExecutor =max(1GB,MemPerCore*CoresPerExecutor)
Number of executors (1/2) • Determine how many executors can be accommodated
per node
• Consider queue constraints
• Consider testing per node with different core values
AvailableExecutors = Min( AvailableNodeMemoryMemoryPerExecutor+Overheads
n=0
ActiveNodes
∑ , AvailableNodeCoresCoresPerExecutor )
UseableExecutors =Min(AvailableExecutors, QueueMemoryMemoryPerExecutor ,
QueueCoresCoresPerExecutor )
Number of executors (2/2) • Compute executor count required for memory
requirements
• Determine final executor count
NeededExecutors = CacheSizeSparkMemPerExecutor*SparkStorageFraction
FinalExecutors =min(UseableExecutors,NeededExecutors*ComputeHungry)
Data Partitioning • Minimally want a partition per executor core
• Increase number of partitions if memory constrained
MinPartitions = NumExecutors*CoresPerExecutor
MemPerTask = SparkMemPerExecutor*(1−SparkStorageFraction)CoresPerExecutor
MemPartitions = CacheSizeMemPerTask
FinalPartitions =max(MinPartitions,MemPartitions)
Driver sizing • Leverage executor size as baseline • Increase driver memory if analysis requires
– Some algorithms very driver memory heavy
• Reduce executor count if needed to reflect presence of cluster-side driver
Sanity checks • Make sure none of the chosen parameters look
crazy! – Massive partition counts – Crazy resourcing for small files – Complex jobs squeezed for overwhelmed clusters
• Think about edge cases • Iterate if necessary!
Future opportunities Cloud Scaling • Autoscale cloud EMR clusters
– Leverage understanding of required resources to dynamically manage the number of compute nodes
– Significant performance benefits and cost savings Iterative improvements • Retain decisions from prior runs in metastore • Harvest additional sizing and runtime information
– Spark REST APIs • Leverage data to fine tune future runs
Conclusions • Enterprise clusters are noisy, shared environments • Configuring Spark can be labor intensive & error prone
– Significant trial and error process • Important to automate the Spark configuration process
– Remove requirement for data scientists to understand Spark and cluster configuration details
• Software automatically reconciles analysis requirements with available resources
• Even simple estimators can do a good job
More details? • Too little time…
– Will blog more details, example resourcing and scala code fragments
alpinedata.com/blog