1
SOME WORKLOAD SCHEDULING ALTERNATIVES IN THE HIGH PERFORMANCE COMPUTING ENVIRONMENT
Jim McGalliard, FEDSIM Tyler Simon,
University of Maryland Baltimore County
CMG Southern RegionRichmond April 25, 2013
2
TOPICSMAINFRAME VS. HPC WORKLOADSHPC WORKLOAD SCHEDULINGBACKFILLDYNAMIC PRIORITIZATIONMAPREDUCE FRAMEWORK & HADOOPHADOOP WORKLOAD SCHEDULINGDYNAMIC PRIORITIZATION IN HADOOPSOME RESULTS
3
HPC Vs. Mainframe Workload SchedulingCMG’s historical context is MVS on S/360In a traditional mainframe environment, a workload is
runs on a single CPU or perhaps a few CPUsProcessor utilization is optimized on one level by
dispatching a new job on a processor when the current job idles that processor, e.g., by starting an I/O
Possible to achieve very high actual processor utilization (e.g., 103%) due to a variety of optimizations, of which job swapping is one
Not how it’s done on HPC
4
HPC vs. Mainframe Workload SchedulingCurrent generation HPCs have many CPUs – say, tens
of thousands. Due to scale economies, clusters of commodity processors are the most common design. Custom-designed processors (e.g., Cray for many years) are too expensive in terms of price vs. performance.
Many HPC applications map some system – typically, a physical system (e.g., Earth’s atmosphere) – into a matrix or network of many similar nodes.
6
HPC vs. Mainframe Workload SchedulingThe application software simulates the behavior or
dynamics of the system represented by assigning parts of the system to nodes of the cluster. After processing, final results are collected from the nodes.
Often, more nodes can represent the behavior of the system more accurately using more nodes.
7
HPC Workload SchedulingSo, in contrast to a traditional mainframe workload, HPC
jobs may use hundreds or thousands of processor nodes simultaneously.
Halting job execution on 1,000 CPUs so that a single CPU can start an I/O operation is not efficient.
Some HPCs have checkpoint/restart capabilities that could allow job interruption and also easier recovery from a system failure. Most do not.
On an HPC system, typically, once a job is dispatched, it is allowed to run uninterrupted until completion.
8
HPC Workload SchedulingThe use of queues and priority classification is common in
HPC environments, as it is with mainframes.A job class might be categorized by the maximum number
of processors a job might require, the estimated length of processing time, or the need for specialized resources.
Users or user groups may be allocated budgets of CPU time or of other resources.
A consequence of such queue structures is that users must estimated their jobs’ processing times. These may not be accurate.
9
HPC Workload SchedulingCluster HPCs often have very low actual processor
utilization – say, 10 or 20%.High reported utilization may measure CPU allocation
rather than CPU-busy timeWorkload optimization by the sys admin is done at the
job rather than CPU level – deciding which jobs will be dispatched on which processors in what sequence.
The ability of an individual job to use its allocated processors is of interest and subject to Amdahl's Law, among other constraints, but is outside the scope of this presentation.
10
The Knapsack Problem
Finding the optimum sequence of jobs, such that, e.g., utilization or throughput is maximized or expansion is minimized, is an example of the Knapsack Problem.
Finding the optimum in a Knapsack Problem is often impossible.
Some variation of the “Greedy algorithm,” in which jobs are sorted on their value, is often implemented.
11
Backfill
HPCs are usually over-subscribed.
Backfill is commonly used to increase processor utilization in an HPC environment.
15
Backfill
Backfill is a common HPC workload scheduling optimization at the job-level.
Note, it depends on the programmer’s estimate of the execution time, which is likely to be influenced by the priority queue structure for that data center and machine.
16
Dynamic, Multi-Factor Prioritization
Bill Ward and Tyler Simon, among others, have proposed setting job priorities as the product of several factors:
(Wait Time)Wait Parameter
* (Est’d Proc Time)Proc Parameter
* (CPUs Requested)CPU Parameter
* Queue = Priority
17
Dynamic Priorities with Budgets
Prioritization can also include a user allocation or budget.
Users can increase or decrease the priorities of their jobs by bidding dollars or some measure of their allocation.
Higher bids mean faster results.Lower bids mean more jobs can be run, at a cost of
waiting longer for results.Similar to 1970s time-sharing charge-back systems.
18
MapReduce
As mentioned, Amdahl’s Law explains why it is in an important HPC environment for the application to be coded to exploit parallelism.
Doing so – programming for parallelism - is an historically difficult problem.
MadReduce is a response to this problem.
19
MapReduceMapReduce is a method for simple implementation of
parallelism in a program. The programmer inserts map() and reduce() calls in the code, the compiler, dispatcher, and operating system take care of distributing the code and data onto multiple nodes.
The map() function distributes the input file data onto many disks and the code onto many processors. The code processes this data in parallel. For example, weather parameters in discrete patches of the surface of the globe.
20
MapReduce
The reduce() function gathers and combines the data from many nodes into generates the output data set.
MapReduce makes it easier for programmers to implement parallel versions of their applications by taking care of the distribution, management, shuffling, and return to and from the many processors and their associated data storage.
21
MapReduce
MapReduce can also take care of reliability by distributing the data redundantly.
Also takes care of load-balancing and performance optimization by distributing the code and data fairly among the many nodes in the cluster.
MapReduce may also be used to implement prioritization by controlling the number of map and reduce slots created and allocated to various users, classes of users, or classes of jobs – the more slots allocated, the faster that user or job will run.
23
MapReduce
MapReduce was originally developed for use by small teams where FIFO or “social scheduling” was sufficient for workload scheduling.
It has grown to be used for very large problems, such as sorting, searching, and indexing large data sets.
Google uses MapReduce to index the world wide web and holds a patent on the method.
MapReduce is not the only framework for parallel processing and has opponents as well as advocates.
24
HADOOP
Hadoop is a specific open source implementation of the MapReduce framework written in Java and licensed by Apache
Hadoop also includes a distributed file systemDesigned for very large (thousands of processors) systems
using commodity processors, including grid systemsImplements data replication for checkpoint (failover) and
localityLocality of processors with their data help network
bandwidth
25
Applications of Hadoop
Data intensive applicationsApplications needing fault
toleranceNot a DBMSYahoo; Facebook; LinkedIn;
AmazonWatson, the IBM system that won
on Jeopardy used Hadoop
27
Hadoop Workload Scheduling
Task Tracker attempts to dispatch processors on nodes near the data (e.g., same rack) the application requires. Hadoop HDFS is “rack aware”
Default Hadoop scheduling is FCFS=FIFOWorkload scheduling to optimize data locality and
also optimize processor utilization are in conflict.There are various alternatives to HDFS and the
default Hadoop scheduler
28
Hadoop Scheduling Alternatives: Fair-Share Scheduler
Fair-Share is a workload scheduler designed to replace FCFS
Organized into pools (a default of one per user)Goal: Ensure Fair Resource UseEach pool entitled to a minimum share of the clusterCan be FIFO or Fair-Share within a poolShares map() and reduce() slots
29
Fair-Share
If excess capacity is available, can be given to a pool that has its minimum
Allows maximum jobs/userAllows maximum jobs/poolAllows pre-emption of jobsCan use size (# of tasks) or priorities
30
Hadoop Scheduling Alternatives: Capacity Scheduler
Organized into queuesQueues share a percent of the clusterGoal: Maximize utilization & throughput, e.g., fast
response time for small jobs and guaranteed response time for production jobs
FIFO within queuesUses pre-emptionQueues can be started and stopped at run timeIf excess capacity is available, can be given to any queue
31
Dynamic Prioritization in Hadoop
Tyler Simon, a collaborator of mine in prior Southern CMG presentations, has studied the application of dynamic prioritization in the Hadoop environment.
An additional consideration in his study: fractional job allocation. Dispatching the job on less than the total number of CPUs requested by the user. Massively parallel jobs can tolerate this.
32
Dynamic Prioritization in Hadoop
Priorities calculated based on same factors as before: actual wait time, estimated processing time, and number of CPUs requested.
Queues are typically structured around those same factors. Using a single queue and reflecting the significance of those factors in the priority calculation is more flexible than fixed queues.
System administrator must still set parameters (relative importance of each factor).
33
Dynamic Prioritization in Hadoop Vs. Fair ShareBackfillNo Pre-emptionFixed shares are optionalImproved Workload Response TimeImproved UtilizationPriorities are dynamic – recalculated at each
scheduling interval
34
Dynamic Prioritization in Hadoop Vs. Ward
Compare to Ward’s proposed dynamic prioritization Use of parameters is the sameSysAdmin implements policy by selecting
parameters the same, but…..Only one queueIncludes fractional Knapsack – fraction of a job may be dispatched to fill the Knapsack
35
Dynamic Prioritization in Hadoop Vs. Ward
Option: System can select its own parameters – self-tuning: run the optimization globally, then calculate priorities when a job completes (or is submitted, or on a schedule)
Or, run the optimization whenever a job completes, selecting the job(s) to run next to yield near-optimum performance for jobs currently in the queue
Can be extended to prioritize on other parameters – e.g., time of day, day of week, reliability, power consumption, etc.
36
Algorithm – Prioritization
Source:“Multiple Objective Scheduling of HPC Workloads Through Dynamic Prioritization”Tyler Simon,Phoung Nguyen,& Milton Halem
42
In Conclusion…
Tyler’s results indicate about a 35% reduction in workload response time (batch turnaround time) compared to the default Hadoop scheduler
Many ideas have been published about workload scheduling and optimization in both HPC and Hadoop
Not clear who will come out on topA lot outside the scope of this presentation
43
Bibliography
Glassbrook, Richard, and McGalliard, James. “Performance Management at an Earth Science Supercomputer Center.” CMG 2003.
Heger, Dominique. “Hadoop Performance Tuning – A Pragmatic & Iterative Approach” www.cmg.org/measureit/issues/mit97/m_97_3.pdf
Kumar, Rajeev and Banerjee, Nilanjan. “Analysis of a Multiobjective Evolutionary Algorithm on the 0-1 Knapsack Problem” Theoretical Computer Science 358 (2006).
44
Bibliography• Nguyen, Phoung; Simon, Tyler A.; Halem, Milton;
Chapman, David; and Li, Quang Li. A Hybrid Scheduling Algorithm for Data Intensive Workloads in a Map Reduce Environment, 5th IEEE/ACM International Conference on Utility and Cloud Computing (UCC2012), pp. 161-167, Chicago, Il, (Nov. 5-8, 2012.)
• Rao, B. Thirumala and others. “Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing.” Global Journal of Computer Science and Technology Volume XI Issue VII May 2011.
• Sandholm, Thomas, and Kai, Kevin. “Dynamic Proportional Share Scheduling in Hadoop,” Proceedings of the 15th International Conference on Job Scheduling Strategies for Parallel Processing, JSSP ‘10. Berlin, Spring-Verlag.
45
Bibliography
Scavlex. http://compprog.wordpress.com/2007/11/20/the-fractional-knapsack-problem.
Simon, Tyler and McGalliard, James. “Multi-Core Processor Memory Contention Benchmark Analysis Case Study,” CMG 2009, Dallas.
Simon, Tyler and others. “Multiple Objective Scheduling of HPC Workoads Through Dynamic Prioritization” HPC 2013, Spring Simulation Conference, The Society for Modeling & Simulation International.
46
BibliographySpear, Carrie and McGalliard, James. “A Queue Simulation
Tool for a High Performance Scientific Computing Center.” CMG 2007, San Diego.
Ward, William A. and others. “Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy.” Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSP ’02, London, Springer-Verlag.