askdfjnaiow

37
32 Energy- and Performance-Aware Scheduling of Tasks on Parallel and Distributed Systems HAFIZ FAHAD SHEIKH, University of Texas at Arlington HENGXING TAN, University of Florida at Gainesville ISHFAQ AHMAD, University of Texas at Arlington SANJAY RANKA and PHANISEKHAR BV, University of Florida at Gainesville Enabled by high-speed networking in commercial, scientific, and government settings, the realm of high performance is burgeoning with greater amounts of computational and storage resources. Large-scale sys- tems such as computational grids consume a significant amount of energy due to their massive sizes. The energy and cooling costs of such systems are often comparable to the procurement costs over a year period. In this survey, we will discuss allocation and scheduling algorithms, systems, and software for reducing power and energy dissipation of workflows on the target platforms of single processors, multicore processors, and distributed systems. Furthermore, recent research achievements will be investigated that deal with power and energy efficiency via different power management techniques and application scheduling algorithms. The article provides a comprehensive presentation of the architectural, software, and algorithmic issues for energy-aware scheduling of workflows on single, multicore, and parallel architectures. It also includes a systematic taxonomy of the algorithms developed in the literature based on the overall optimization goals and characteristics of applications. Categories and Subject Descriptors: C.4 [Computer System Organization]: Performance of Systems General Terms: Algorithms, Performance, Measurement Additional Key Words and Phrases: Energy-aware scheduling, task allocation algorithms, dynamic voltage and frequency scaling, dynamic power management ACM Reference Format: Sheikh, H. F., Tan, H., Ahmad, I., Ranka, S., and Bv, P. 2012. Energy- and performance-aware scheduling of tasks on parallel and distributed systems. ACM J. Emerg. Technol. Comput. Syst. 8, 4, Article 32 (October 2012), 37 pages. DOI = 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743 1. INTRODUCTION Massive energy consumption is an escalating threat to the environment. The explosive growth of computers results in significantly increasing the consumption of precious natural resources such as oil and coal, aggravating the looming conjuncture of energy shortage. Studies have reported that computers consume more than 8% of the total energy produced and this fraction is growing [Andreae 1991; Green Grid 2012]. A report of Dataquest [1992] stated that total expenditure of power by processors in PCs in the worldwide range was 160MW in 1992, then had grown to 9000MW by the This work was supported by a grant from the National Science Foundation under contract no. CCF-0905308 and CRS-0905196. Authors’ addresses: H. F. Sheikh, University of Texas at Arlington, TX; email: Hafizfahad.sheikh@ mavs.uta.edu; H. Tan, University of Florida at Gainesville, FL; I. Ahmad (corresponding author), University of Texas at Arlington, TX; email: [email protected]; S. Ranka, P. Bv, University of Florida at Gainesville, FL. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2012 ACM 1550-4832/2012/10-ART32 $15.00 DOI 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743 ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Upload: narendra

Post on 22-Jan-2016

213 views

Category:

Documents


0 download

DESCRIPTION

super

TRANSCRIPT

Page 1: askdfjnaiow

32

Energy- and Performance-Aware Scheduling of Tasks on Paralleland Distributed Systems

HAFIZ FAHAD SHEIKH, University of Texas at ArlingtonHENGXING TAN, University of Florida at GainesvilleISHFAQ AHMAD, University of Texas at ArlingtonSANJAY RANKA and PHANISEKHAR BV, University of Florida at Gainesville

Enabled by high-speed networking in commercial, scientific, and government settings, the realm of highperformance is burgeoning with greater amounts of computational and storage resources. Large-scale sys-tems such as computational grids consume a significant amount of energy due to their massive sizes. Theenergy and cooling costs of such systems are often comparable to the procurement costs over a year period. Inthis survey, we will discuss allocation and scheduling algorithms, systems, and software for reducing powerand energy dissipation of workflows on the target platforms of single processors, multicore processors, anddistributed systems. Furthermore, recent research achievements will be investigated that deal with powerand energy efficiency via different power management techniques and application scheduling algorithms.The article provides a comprehensive presentation of the architectural, software, and algorithmic issuesfor energy-aware scheduling of workflows on single, multicore, and parallel architectures. It also includes asystematic taxonomy of the algorithms developed in the literature based on the overall optimization goalsand characteristics of applications.

Categories and Subject Descriptors: C.4 [Computer System Organization]: Performance of Systems

General Terms: Algorithms, Performance, Measurement

Additional Key Words and Phrases: Energy-aware scheduling, task allocation algorithms, dynamic voltageand frequency scaling, dynamic power management

ACM Reference Format:Sheikh, H. F., Tan, H., Ahmad, I., Ranka, S., and Bv, P. 2012. Energy- and performance-aware scheduling oftasks on parallel and distributed systems. ACM J. Emerg. Technol. Comput. Syst. 8, 4, Article 32 (October2012), 37 pages.DOI = 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743

1. INTRODUCTION

Massive energy consumption is an escalating threat to the environment. The explosivegrowth of computers results in significantly increasing the consumption of preciousnatural resources such as oil and coal, aggravating the looming conjuncture of energyshortage. Studies have reported that computers consume more than 8% of the totalenergy produced and this fraction is growing [Andreae 1991; Green Grid 2012]. Areport of Dataquest [1992] stated that total expenditure of power by processors inPCs in the worldwide range was 160MW in 1992, then had grown to 9000MW by the

This work was supported by a grant from the National Science Foundation under contract no. CCF-0905308and CRS-0905196.Authors’ addresses: H. F. Sheikh, University of Texas at Arlington, TX; email: [email protected]; H. Tan, University of Florida at Gainesville, FL; I. Ahmad (corresponding author), Universityof Texas at Arlington, TX; email: [email protected]; S. Ranka, P. Bv, University of Florida at Gainesville, FL.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2012 ACM 1550-4832/2012/10-ART32 $15.00

DOI 10.1145/2367736.2367743 http://doi.acm.org/10.1145/2367736.2367743

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 2: askdfjnaiow

32:2 H. F. Sheikh et al.

year 2001. A large percentage (around 42%) of U.S. firms are already expecting toreach their power and cooling capacities within the next few years [Uptime 2012]. Byaddressing these issues, IT organizations are motivated to seek better managementmethods for increasing computing, network, and storage demands, and lowering energyusage, while remaining competitive and able to meet future business needs. Power-aware computing is important to large-scale systems because they are burning a largeportion of energy consumed by IT devices (see EPA’s report to congress [USEPA 2007]).Novel resource management strategies that consider energy consumption upfront as aprecious resource and provide means to manage it at scale along with performance aresorely needed.

Enabled by high-speed networking in commercial, scientific, and government set-tings [ATLAS 1999; CMS 2012; Loveday 2002], the realm of high-performance comput-ing is burgeoning with both computational and storage resources [Atlas 1999; Loveday2002; NASAES 2012; CMS 2012]. Large-scale systems such as computational gridsconsume a significant amount of energy due to their massive sizes [Aea 2008; Feng andCameron 2007]. The energy and cooling costs of such systems are often comparable tothe procurement costs over a three-year period [ORNL 2012; Bland 2006].

In this survey, we will discuss allocation and scheduling algorithms, systems, andsoftware for reducing power and energy dissipation of workflows on the target plat-forms of single processors, multicore processors, and distributed systems. Furthermore,recent research achievements will be investigated that deal with power and energy ef-ficiency via different power management techniques and scheduling algorithms.

The key contributions of our survey are the following:

(1) It provides a comprehensive presentation of the architectural, software, and algo-rithmic issues for energy-aware scheduling of workflows on singlecore, multicore,and parallel architectures.

(2) It provides a systematic taxonomy of the algorithms developed in the literaturebased on the overall optimization goals and characteristics of applications.

The rest of the article is organized as follows. In Section 2, we define the workflowallocation problem. In Section 3 we provide a general overview of different energy-aware mechanisms and systems. Section 4 organizes the research efforts for energy-aware task allocation into different categories while providing details of each work.Section 5 discusses some important aspects based on the details presented in Section 4.Section 6 concludes the work and highlights some of the future research directions.

2. ENERGY-AWARE TASKING ALLOCATION (EATA) PROBLEM

In this section we will first formulate the energy-aware tasking allocation problem ingeneral terms without assuming a particular platform of workload. Then we presentvarious models that are used to represent different types of applications as well as forsystems while solving the aforesaid problem.

2.1. Problem Formulation

The goal of the energy-aware tasking scheduling problem is to assign tasks to one ormore cores so that performance and energy objectives are simultaneously met. Thus,algorithms for EATA typically need to solve a multiobjective optimization problem (i.e.,reduce energy or enhance performance, or both). There is no unique solution [Ghazalehet al. 2003; Subrata et al. 2008] for a MultiObjective Optimization (MOO) problem. As il-lustrated in Figure 1, the MOO problem is generated when multiple objective functionsconflict: in a given application and processor system, termed as (J, M), neither powerconsumption nor schedule length can be improved without trading off the other one,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 3: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:3

Fig. 1. Illustration of the MOO-NLP problem (adapted from Ahmad et al. [2008]).

along the curve AB, the so-called Pareto-optimal frontier. The point A is the operatingpoint at which power consumption is minimized and at B we have the schedule lengthminimization. From the given Pareto-front, we can select appropriate operating pointsalong curve AB to suit various constraints and requirements. The MOO problem com-bined with Multiple Elements optimization (ME) becomes the MEMO problem.

There are three broad goals that have been found to be of practical importance.

(1) Minimize energy consumption with allowable reduction in quality of service, knownhere as Performance-Constrained Energy Optimization (PCEO). This involves theminimization of additional time needed (performance degradation) to completetasks when faced with the limitation of not exceeding a requirement of execution(response) time, which can lead to direct and indirect energy savings.

(2) Minimize the time requirement during the course of execution when given a totalenergy budget. This can be seen as the opposite end of the spectrum of the firstpotential objective in that energy is fixed and performance is maximized. Thissurvey will refer to this kind of approach as Energy-Constrained PerformanceOptimization (ECPO).

(3) Optimize a combination of the previous two goals that works to optimize bothperformance as well as energy efficiency by balancing the two objectives simulta-neously. The target of this kind of approaches is to confer the imposed task andenergy requirements and at the same time minimize the penalty induced by re-quirements violation, known here as Dual Energy and Performance Optimization(DEPO).

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 4: askdfjnaiow

32:4 H. F. Sheikh et al.

2.2. Power and Energy Model

Before introducing power management techniques, we begin with the fundamentalterms and equations. The formal representation of the relationship between power andenergy can be defined as

E =∫

T

P · dt,

where P denotes the power, T denotes the continuous time duration, and E denotes therequired energy consumed. It is easy to see that energy represents the accumulatedusage of work performed during a time period, measured in joules (J), whereas thepower is considered as the average value of work over time, measured in watts (W).Especially when the time interval is shrinking to a point, the power can be consideredas an instant value of work rate at that time point. By realizing the difference betweenpower and energy, we can understand that in some cases, reducing power consumptionmay not reduce the energy usage. For instance, if we lower the CPU speed, the powerconsumption will be decreased. However, the execution time may be prolonged due tothe decreasing performance, which may lead to the same energy usage to complete theassigned workloads. The estimate about the power consumption of a task or a set oftasks is a key element in static scheduling and allocation schemes for improving theenergy consumption of different platforms. Accurate power consumption profiles canbe obtained via profiling, but this can only be done in a complete sense if the workloadis completely known ahead of time. Since most of the times this may not be the casethe power consumption needs to be estimated.

Leakage and Dynamic Power. The power expenditure in CMOS circuits has twocomponents: dynamic and leakage power. The leakage current in any voltage-appliedopen circuit can generate leakage power [Chandrakasan et al. 1992]. Thus this kind ofpower is only determined by the voltage and circuit physics. Since the leakage powerhas no relationship with the clock rate of computer components and workloads appliedon the components, it is ignored by most early research on power-aware computing.

On the contrary, dynamic power consumption results from switching activities in thecircuit. This, in general, can be dependent on the clock rate and application workloadsapplied to the components. Formally, the dynamic power consumption comes fromtwo sources: the switching capacitance and short-circuit current [Venkatachalam andFranz 2005]. Between them the switched capacitance acts as the primary role leadingto the dynamic power dissipation, and the circuit current is hard to reduce. Thus, thedynamic power model is usually abstracted as

PDynamic = Cc · V 2 · f,

where dynamic power, PDynamic, is determined by switching capacitance Cc, frequencyf (clock rate), and supply voltage V [Venkatachalam and Franz 2005]. Among the pa-rameters, switching capacitance is the physical character of the circuit so is alwaysconsidered as constant. Lots of techniques can easily scale the supply voltage and fre-quency in a large range. Thus these two parameters attract a large portion of attentionin power-conscious computing research.

However, considerations of leakage power effects versus dynamic power are yet takenby some researchers. In general, with every upcoming technology targeted to cater tothe increasing demand of computational needs, the size of CMOS devices gets reducedby approximately 30%. The decrease in device dimension coupled with reduction ofthreshold voltage can lead to an increase in leakage power (as much as five times) witheach generation of technology [Borkar 1999]. Therefore, it is necessary to consider bothleakage and dynamic power during energy-aware scheduling. It is possible to reduce

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 5: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:5

dynamic power consumption using DVFS by reducing the voltage level when slack isavailable. However, longer execution time due to voltage reduction will lead to higherdissipation of leakage power. Reduction in the voltage level beyond a certain point isnot beneficial due to the dominance of leakage power [Jejurikar et al. 2004], as theincrease in leakage power is more than the decrease in dynamic power. Both leakageand dynamic power should be considered during scheduling decisions.

2.3. Application Model

Parallel applications, such as scientific, mathematical, and bioinformatics problems,typically consist of workflows involving multiple tasks, with or without precedenceconstraints. In some cases, the user or application is generally interested in the com-pletion of the entire workflow. Also, in some other cases, tasks are considered as indi-viduals without dependency constraints. In each of the following subsections, we brieflydescribe the characteristics of a large class of these applications.

Independent tasks. Numerous studies [Lu et al. 2000; Yu and Prasanna 2002; Aydinet al. 2004; Ge et al. 2005; Zhu and Mueller 2005; Shin and Choi 2007; Seo et al. 2008;Zhang and Chatha 2007] construct their research with an independent task modelthat does not consider the precedence relations between tasks. These models are mostuseful for general-purpose CPUs and applications such as desktops and servers. Forparallel computing environments, communications between tasks is not considered inthe independent task model.

In general, three kinds of independent task modes are of interest: periodic, aperiodicand sporadic tasks [Buttazzo 2005]. Identical jobs repeating after a constant periodcontinuously are called periodic tasks. If a periodic task with period T is initialized at ϕ,then the triggering/activation time (at) of the kth instance is ϕ + (k – l)T. A periodic taskis usually attributed by its worst-case execution time (wcet) and its relative deadline(D). Aperiodic tasks are different from periodic tasks in terms of having a variable timeperiod for task repetitions. A hard aperiodic task can be characterized by its arrivaltime, worst-case execution time, and relative deadline. A soft aperiodic task has nostrict deadline. Sporadic tasks consist of an infinite sequence of identical activitieswith irregular activations. Though tasks can be triggered arbitrarily, every consecutivepair of activations has a defined minimum inter-arrival time. The attributes of thesporadic task are usually its relative deadline, minimum inter-arrival time (λ), andwcet. Scheduling algorithms based on these task models will be investigated in latersections.

In heterogeneous grid computing environments, applications are often modeled withthe bag-of-tasks application model [Chung et al. 1999; Kim et al. 2007; Lee and Zomaya2007] that consists of a set of independent tasks without any inter-task communica-tions. An application is only considered finished when the whole task set completesits execution. A typical BoT application, termed J, is composed by multiple indepen-dent heterogeneous tasks denoted by{T1, T2, . . . , Tn}, without precedence relationshipsamong them. The BoT model can be distinguished as compute intensive and data in-tensive [Lee and Zomaya 2007]. In the compute-intensive BoT model, the major cost(execution time) is from the computation, while time spent on data retrieving dataand transporting it (from a node to another) is small and can be ignored. For a data-intensive BoT model, one needs to consider the cost of retrieving data. The data sourcesof task Ti in job J can be a set of objects, denoted by {Ii,1, Ii,2, . . . Ii,d}. The cost of trans-ferring input data among tasks, that is, changing the relationship between data objectsand tasks, is considerable and can impact the total performance. Furthermore, a datasource can supply data for multiple tasks. Such a data-intensive model is more realisticthan the compute-intensive model, especially in distributed systems, because the datatransportation may happen frequently between nodes or servers.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 6: askdfjnaiow

32:6 H. F. Sheikh et al.

Task1

5

Task2

3

Task4

3

Task3

5

Task10

3

Task6

7

Task7

5

Task8

5

Task5

3

2

1

2

2

1

2

3

4

2

3

Task11

6Task9

4

Task0

6

34 2

4

Fig. 2. A simple task graph.

Dependent tasks. When the tasks are constrained by precedence relationships, theDAG (Directed Acyclic Graph) model is commonly used to represent these tasks [Kangand Ranka 2008a, 2008b; Lee and Zomaya 2009]. A DAG (Figure 2) can illustrate theworkflow among the tasks of an application in terms of G = (N, E), in which N denotesthe set of nodes that represent tasks, while E denotes the set of edges, that representthe dependency relationships among nodes. Furthermore, an edge between two nodescan also represent the inter-task communication. The critical path is the longest pathof the graph; given a task, its closest predecessor (with the latest finish time) is namedas MIP (Most Influential Parent).

By involving the attribute weight into nodes, the computation cost can be quantified,while weight on each edge between two tasks represents the communication cost be-tween them. In practice, most algorithms only allow edges where the tasks presentedby the end nodes are assigned on different processors or servers. That is, the communi-cation cost between tasks allocated to the same processor or server is ignored. A typicalutilization of the DAG application model is task scheduling via slack reclamation, thedetails of which will be discussed in following sections. When distributing slacks totasks (this is equal to inserting a task into the free time slots between another twotasks) the precedence relationship reflected by the graph cannot be violated.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 7: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:7

As we observed from the previous discussion, tasks represented by different modelsmay require different considerations during the task scheduling and allocation phase.Similarly, the opportunities for saving energy can be different based on the type of taskmodel.

3. ENERGY-AWARE MECHANISMS AND SYSTEMS

In this section we will discuss basic underlying techniques used by various energy-aware task scheduling and allocation schemes.

3.1. Mechanisms and Methodologies for Power Management

3.1.1. ACPI. Almost all modern computer systems can provide power managementfunctionalities. Initially such power-aware policies or mechanisms were implementedinside computer BIOS or chip firmware, thus bringing inflexibility to configure powerevents of devices. Several leading organizations in the PC industry agreed on speci-fying a common standard of configuration interface between computer hardware andsoftware. Then ACPI (Advanced Configuration and Power Interface) was published in1999 [ACPI 1999], which provides the functionality as operating-system-level powermanagement (OSPM) to take over the control of device configuration and power man-agement events from hardware. Thus OS as well as upper-layer applications have ageneral interface to address the configuration and power information of either individ-ual devices or the whole computer system. Furthermore, ACPI defines multiple powerstates for both the system and individual devices. The system or device should workin one of the states which represent scenarios with different upper bounds of powersupply. For example, G0–G3 are defined as global power state of the whole system.Among them, G0 represents the working state, where the system has the maximalpower consumption, while G3 represents the mechanical off state, where the systemis almost off and the only power consumption is from the RTC battery on the moth-erboard. Similarly, device power states are represented by states D0–D3 and statesC0–C3 represent the processor power states since the CPU occupies the largest portionof power consumption of the whole system. Moreover, ACPI also define performancestates, represented by P0 − Pn, and sleep states, represented by S0 − Sn, where themaximal value of n is dependent on the device. Given the performance states, ACPIprovides the system functionality to dynamically scale the voltage or frequency toachieve different performance levels. Among the sleep states, deeper sleep states implyless idle power consumption, however, more power is necessary to reactivate the devicefrom this state. Currently, ACPI can provide more advanced features such as mobiledevice management and flexible thermal management. ACPI brings significant flexi-bility and convenience for people to design and implement power management policiesor algorithms on an OS and application level.

3.1.2. DVFS. From the dynamic power equation in Section 2.2, power consumptioncan be conserved by scaling the voltage and frequency of the CPU. Since the supplyvoltage applied to a circuit has mutual effects with its frequency, the dynamic scalingof frequency in most scenarios is considered the same technique with dynamic voltagescaling, termed Dynamic Voltage and Frequency Scaling (DVFS). The relationshipbetween supply voltage and frequency can be approximated as [Venkatachalam andFranz 2005]

f ∞ (V − Vth)α

V,

where f denotes the frequency, V denotes the supply voltage, Vth denotes thresholdvoltage, and α is a constant determined by circuit physics. Since when a circuit isgiven, the threshold voltage is almost a constant, power consumption can be modeled

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 8: askdfjnaiow

32:8 H. F. Sheikh et al.

as the function of frequency or supply voltage. Especially when α is approximatedas 2, the relationship between frequency and supply voltage is linear, thus dynamicpower is considered to be of cubic frequency or supply voltage. More important, sincethe execution time of the CPU is averse to its clock rate, say, frequency, the problemof minimizing power consumption with a constrained performance requirement can bemodeled as a convex optimization problem. This is the basis of lots of energy-awarescheduling algorithms. We will investigate these algorithms in the following sections.

In practice, DVFS may not achieve ideal experimental results due to two kinds ofproblems [Venkatachalam and Franz 2005]. First, the nature of workloads may be un-predictable. Such a complication can result from execution preemption by high-priorityprocesses or inaccuracy of estimation of future task execution time. Second, nondeter-minism and anomalies of real systems can bring additional errors to the dynamicpower model: dynamic power dissipation is cubic in supply voltage and execution timeis inversely proportional to clock frequency.

Several empirical approaches are found to apply DVFS to real systems, most ofwhich can be categorized as interval-based approaches, intertask approaches, or in-tratask approaches [Venkatachalam and Franz 2005]. The interval-based approachescan estimate CPU utilization in the near future by analyzing historical CPU utiliza-tion. Instead of analyzing CPU utilization information, intertask approaches observethe characteristics of individual tasks like execution time, deadline, etc., and assignappropriate voltage and CPU frequency to them, respectively, thus reducing the to-tal power usage. On the other hand, intratask approaches are based on the fact thatthe workloads and resource requirements of a task may vary during the whole execu-tion period. Such approaches divide each task into fine-grained time slots and assignvarious CPU voltages and frequencies on each time slot, respectively.

3.1.3. Scheduling Considerations with DVFS. In general, in a DVFS-enabled processor,higher processing speed results in faster processing of tasks and hence shorter schedulelength, but consumes more energy. In contrast, slowing cores incurs lower energyconsumption, but at the expense of increased execution time. Thus, the two objectivesare incommensurate and in conflict with one another. In designing an energy-efficientsystem (hardware and software), the following issues are important.

(1) Optimizing power may not optimize energy. As mentioned in previous sections, en-ergy can be considered as the accumulation of power over a period of time [Gonzalezand Horowitz 1996]. A system that aggressively decreases the clock rate to reducepower may lead to significant increase in time, thereby increasing the overall en-ergy consumption.

(2) Simplistic power-aware designs may increase power or energy consumption [Stout2006; Pering et al. 2000]. Systematic and overall reduction in power can lead tobetter overall energy consumption [Usami and Horowitz 1995].

(3) Optimizing the average power consumption often optimizes the maximum powerconsumption [Liang and Ahmad 2006]. However, in some cases this could lead toan increase in peak power requirements [Luo and Jha 2000].

(4) Benefits of voltage scaling may be limited due to the cost of overheads. Someanalytical estimates of such overheads can be found in Brooks et al. [2000] andBurd et al. [2000], but still there seems to be a need for a more accurate simulation.

Effect of CPU switching overheads. An important consideration in scheduling is theDVFS switching overhead that is studied for various available processors [Albonesi2002; Brook and Rajamani 2003]. Another factor is that unused cores can suffer en-ergy wastage unless completely powered down (repowering may incur new overhead).Switching takes a certain amount of time that, despite continuous improvements,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 9: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:9

incurs delays [Albonesi 2002]. DVFS switching is typically done in a synchronous orasynchronous manner. Traditional DVS-enabled processors use synchronous switchingin which execution is blocked. In asynchronous switching, execution is not blocked butit incurs a “ramp-up” effect in which a surge in current occurs to raise the voltage tothe required frequency. Asynchronous overhead is relatively lower but the frequencyof switching can outweigh the benefit. Sophisticated DVFS switching algorithms havebeen proposed [Albonesi 2002; Zhu and Mueller 2005], offering the selection of anappropriate frequency to meet the task slack. These factors should be considered tofine-tune scheduling algorithms.

3.1.4. Energy-Aware Software. Power efficiency is critical to wireless and mobile devicesbecause of the strict power budget provided by the battery. In Pering et al. [2006], theauthors propose a framework for wireless devices, called CoolSpots, that can switchamong different radio interfaces like WiFi and Bluetooth. Thus it can achieve substan-tial energy savings and increase the battery lifetime. In Flinn and Satyanarayanan[2004], the authors investigate the possibility of extending the adaptation of theOdyssey system for a weak-power environment. On the basis of the Coda system, whichis a distributed file system for mobile devices, Odyssey manages all the resources andmakes decisions to scale application quality to adapt to different scenarios. For exam-ple, in a weak connection network, the video stream transported between devices can beset to low quality to save bandwidth. Similarly, in limited energy budget, PowerScope,the power management tool in Odyssey, the server (the device running applications andcomputing) can negotiate with the client (the device sending requests and receivingresults) to adjust the requirement of application quality. In this way, the short-termpower consumption can be predicted from the agreement of quality requirements sothat global power conservation can be achieved. In Oikawa and Rajkumar [1999], theauthors extend the Linux kernel to integrate dynamic power management techniquesincluding DVFS. Not only on Linux, but also on some other OS’s the resource kernel canbe migrated to with small changes, so it is called portable RK. RK can isolate resourcesfor applications with different performance or power requirements so global energysavings can be achieved. PowerDial is a middleware system of the Grid Computing en-vironment for dynamically adapting the behavior of running applications to respond tofluctuations in load, power, or any other event that threatens the ability of the comput-ing platform to deliver adequate computing power to satisfy demand [Hoffmann et al.2011]. PowerDial transforms static configuration parameters into dynamic knobs andcontains a control system that uses the dynamic knobs to maintain performance in theface of load fluctuations, power fluctuations, or any other event that may impair theability of the application to successfully service its load with the given computationalresources. It uses dynamic influence tracing to transform static application configu-ration parameters into dynamic control variables stored in the address space of therunning application. These control variables are made available via a set of dynamicknobs that can change the configuration (and therefore the point in the trade-off spacewhere it executes) of a running application without interrupting service or otherwiseperturbing the execution.

For large-scale data centers, power or energy consumption certainly occupies themajor part of the operating cost. While architecture-level dynamic power managementtechniques like DVFS are widely adopted by current computing servers, programmerscontinue to seek ways to make their programs more energy efficient. Many clustercomputing environments provide software to help programmers estimate the cost ofeach block of their source-code; however, they cannot evaluate and self-adjust theimpact to the program effect due to their estimations. In Baek and Chilimbi [2010], theauthors propose Green, a framework that can provide approximation of the functions

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 10: askdfjnaiow

32:10 H. F. Sheikh et al.

and loop blocks in programs. Furthermore, it also provides a model to evaluate the losson performance (QoS) due to the approximation. During such a “calibration” process,the approximation can be adjusted in a timely and accurate fashion.

For dynamically scheduling tasks, that is, the global task queue and resource allo-cation that is decided before execution and may change during runtime, resource andenergy monitoring techniques act as key roles, either on that architecture level or sys-tem level. In Ge et al. [2010], the authors propose a power monitoring tool, PowerPack,which can obtain component-level power measurements at function-level granularityin large-scale systems. This framework can provide functionalities to monitor, track,and analyze the performance and power consumption of applications in distributedsystems. The PowerPack toolkit is capable of profiling the power consumption of bothindividual components and application functions.

4. EATA ALGORITHMS

4.1. Overview

Scheduling algorithms on single processors are comparatively simple. Since offlinetask scheduling on single processors can be shown to have global optimal solutions,online algorithms and complex task models for single processors have been attractingrecent research interest. Moreover, even if the CPU occupies the largest portion ofpower consumption among computer components, the profit of power saving by scalingdown other devices, like caches and communication links, can’t be ignored when thenumber of CPUs in the system is small. Unlike computer components like the CPUand storage devices for which speed can be scaled, those devices which cannot bescaled have more complex power management issues. A typical case is that prolongingexecution time of tasks on the nonscaling device can increase the energy dissipation,while such an activity on the CPU can reduce energy consumption. Thus, power-awarescheduling algorithms should also take the effects on these nonscaling devices intoaccount. In Swaminathan and Chakrabarty [2005], the authors propose switch-offpolicies for those devices without workloads in a period. Their strategy can be appliedon systems without DVFS. In Jejurikar and Gupta [2005b], a heuristic is provided forthe system with discrete execution states to determine the appropriate states (speed)of given devices. Similarly, Mochocki et al. [2007] take not only the CPU but also thenetwork interface into account to conserve energy. Zhang et al. [2005] introduce acache-tuning policy to save energy by reconfiguring cache modes.

Most of the energy supplied to multicore systems is consumed by CPUs. Thus dy-namic scaling techniques on the CPU, like DVFS, can effectively achieve power savingsfor multicore systems. There has been considerable research on developing algorithmsfor scheduling and assigning tasks on DVFS-enabled multicore processors [Aydin et al.2004; Zhu et al. 2003; AMD 2008]. These mainly address the independent or real-time task model. As a derivative technique of DVFS on scheduling, slack reclamationtechniques are proposed for various system and application models [Felter et al. 2005;Feng and Cameron 2007; Kang and Ranka 2008a, 2008b]. The basic idea is motivatedby the fact that some tasks may complete earlier than the required deadline so leaveunused time periods that can be added to incomplete tasks [Aydin et al. 2004; Jejurikarand Gupta 2005a]. Besides device scaling techniques, another general methodology tosave energy for multicore systems is workload consolidation [Jerger et al. 2007], whichcan reduce power dissipation by scheduling workloads onto the minimum of activecomputing resources. A special workload technique is VM consolidation on distributedserver clusters. From the perspective of scheduling algorithms, multicore systems arevery similar to distributed systems. Since architectural details are out of scope of this

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 11: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:11

Table I. Objective Functions to Perform Energy and Execution Time Trade-Offs for Each Scenario

Function Objective DescriptionPerformance-Constrained

Energy Optimization(PCEO)

Minimize energyconsumption withpermitted loss inquality of service.

Assume a schedule for a workflow thatoptimizes the execution time on a set ofprocessors is available. Determine anadjusted schedule that minimizes theenergy by considering an additional slackallowed over the execution time.

Energy-ConstrainedPerformanceOptimization (ECPO)

Minimize the executiontime under the givenenergy requirement.

Determine the normal energy requirement ofan application, and then find the minimumexecution time with total energy budgetreduced by a given factor, say 70%.

Dual Energy andPerformanceOptimization (DEPO)

Minimize the overallpenalty for violatingthe timing and energyconstraints.

A budget for energy and executionrequirements for all the cores and theviolation of any constraint results inincurring penalty. The overall penalty is tobe minimized.

survey, we only give a high-level classification of uniprocessing and multiprocessingplatforms where different scheduling algorithms run.

4.2. Algorithm Taxonomy

In this section, we will first present the taxonomy for the algorithms and approachesused for energy-aware scheduling, followed by the details of work related to eachcategory. However, while summarizing the corresponding research for multicore anddistributed systems we also present some details about the approaches used for singleprocessor systems. This helps in elaborating how the scope of the problem as wellas the solution techniques change with the type of system under consideration. Themain actions for mutual trade-off between performance and energy are summarized inTable I.

(1) Scenario 1. Suppose there is a need to encode real-time H.264 video at 30 framesper second to ensure the best possible video quality for a live video stream, ora need to solve a set of fluid dynamics equations for weather forecasting to bebroadcasted in that evening’s news. In these cases, there are deadlines (althoughnot hard deadlines like those in hard real-time systems) that need to be met alongwith the goal of minimizing energy consumption. This corresponds to the PCE-OPTfunction.

(2) Scenario 2. It is an unusually hot July and the peak month in energy consumption(based on location, the peak month could also be December or January). Energy isscarce and increasingly expensive. A system manager has to ration energy basedon a given energy budget. A physics user of the machine needs to execute an ap-plication including FFT (Fast Fourier Transform) and sparse linear algebra tasks,which takes hours to days. The manager determines the application’s normal powerrequirement and then reduces the allocated energy budget by a given factor (say70%). This corresponds to the ECP-OPT function.

(3) Scenario 3. A user has to complete her parallel programming assignment wherethere are penalties attached for late submission, but has a limited energy quota.The scheduling algorithm should optimize both energy and performance. The ob-jective, in this case, is to find the optimal matches between tasks and processorsthat balance minimization of the total energy utilized and reducing the makespan.Variations in this general scenario are possible. For example, deviation from con-straints incurs penalties and thus the balance has to be achieved by consideringsuch penalties. This corresponds to the DEP-OPT function.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 12: askdfjnaiow

32:12 H. F. Sheikh et al.

Table II. Related Work for PCEO on Single Processor Platforms

Algorithmic Performance PowerPaper Approach Definition Task Model Control Strategy Components[Tomiyama

et al. 1998]Heuristics Instruction Cache

MissesDependent tasks,

represented byDAG

InstructionScheduling

Cache

[Shin andChoi 1999]

Heuristic Deadline Independent,periodic tasks

DVFS CPU

[Lu et al.2000]

GreedyHeuristic

Timing Constraint Independent tasks DPM CPU, Disk,Network

[Zhang et al.2002]

LP (Algorithm) Deadline,precedenceconstraints

Dependent Discrete andcontinuous DVS

CPU

[Aydin et al.2004]

Greedyheuristic

Deadline Independent,periodic tasks

Discrete DVFSlevels

CPU

[Gniady et al.2004]

Predictor Mis-predictions DPM I/O devices

[Zhu andMueller2005]

Heuristic Deadline Independent,Periodic, Fullypreemptive

DVFS (discrete,feedback)

CPU

[Zhou andChakrabarki2005]

GreedyHeuristic

Time Complexity Periodic DVFS (continuous) CPU

[Zhang andChatha2007]

Approximate(FPTAS)

Deadline,Quality bound

Periodic,Independent

DVFS CPU

Additional scenarios are possible but here we focus on the preceding three scenar-ios since these fundamentally capture various energy and performance trade-offs. Aspecific scenario may be applied by the energy manager based on the type of appli-cations and users. For example, there may be some applications that have very strictperformance requirements and thus need high quality of service. To meet both theperformance and energy requirements, the manager may select different scenariosover a period of time to satisfy the demands of her users.

4.2.1. Performance-Constrained Energy Optimization (PCEO) Algorithms. PCEO algorithmsrefer to scheduling algorithms minimizing energy consumption under performanceconstraints. The performance metric varies for different scenarios, which can be interms of execution time, response time, QoS, SLA, and so on. Since in popular powermodels, power can be modeled as a polynomial form of reciprocal performance metrics(for example, the dynamic power model), a small sacrifice of performance can bringsignificant power efficiency. Thus PCEO problems attract lots of attention in energy-conscious computing research.

—PCEO on Single Processor Platforms.First we will discuss the related work for performance-constrained energy optimiza-

tion schemes for single processor systems (Table II).—Complier-Optimized Energy Savings. To reduce the overall system’s power consump-tion many high-performance embedded processors leverage on-chip caches. This is be-cause driving off-chip caches is power intensive and at the same time on-chip cachescan reduce data transfers among chips on the system board. But with on-chip cachestoo, the power requirement of the caches is still very high, about 70% of total chippower [Ahmad et al. 1998; Tomiyama et al. 1998]. A compiler optimization techniqueproposed in Tomiyama et al. [1998] aims to reduce energy consumption per instructioncache miss. The power consumed by on-chip drivers is minimized by reducing the datatransfers between on-chip cache and main memory. Although many compiler optimiza-tion techniques to improve cache performance, such as prefetching of instructions and

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 13: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:13

data, loop transformations for data caches, and code placement for instruction caches[Abdelzaher and Lu 2001], have been proposed, few strides in compiler optimizationhave been made towards reducing the average energy consumption per cache miss.The technique is very effective for embedded system design since it requires neither anadditional hardware cost nor a loss in the system’s performance. Furthermore, mostcompiler optimizations targeting cache miss reduction and the technique proposed inthis article are not exclusive but complementary to each other. It has been shownbased on the results from the conducted experiments that the proposed schedulingalgorithm reduces the transitions between on-chip cache and memory by up to 28%without achieving excessive overheads.

—Energy Savings through Task Prioritizing. Shin and Choi [1999] take a simple buteffective approach towards modifying fixed priority scheduling, commonly used in real-time system schedulers to achieve energy savings while still meeting hard real-timesystem deadlines. They considered embedded, real-time system static scheduling prob-lems for periodic tasks on a single core making use of dynamic voltage scaling. Theauthors first identify the key characteristics of periodic real-time systems that can beexploited for savings in energy. They identify execution time variation between theworst and the best cases along with idle time intervals and how they can be used forpower savings. In addition to execution time variations and idle intervals, dynamicprocessor speed variations are applied for maximum benefit. They state that theirpower-conscious scheduling policies can be applied with only slight changes to a con-ventional fixed policy scheduler implemented in a real-time system kernel.

A fixed priority scheduler is described as utilizing two queues: a run queue and adelay queue. The run queue contains tasks waiting to be executed in order of priority.And the delay queue contains tasks already executed in the current period but waitingto run in the next period, ordered by time of release. Once the scheduler is invoked,it checks to see if a task should be promoted from the delay queue to the run queue.If a task or tasks are moved into the run queue, the priorities are compared with theactive task to check if the active task should be swapped with a higher-priority taskin the run queue. The Low-Power Fixed Priority Scheduler (LPFPS) adds two casesto the scheduler when the run queue is empty. In the absence of an active task, thescheduler brings the processor to an idle state. However, if there is an active task, theprocessor’s power is scaled down to a speed appropriate to complete the task in time tostart the next task. A heuristic to compute the ratio of the processor’s speed is used forthe appropriate speed.

—Energy Optimization across Multiple Devices. Lu et al. [2000] present a multide-vice, dynamic, energy-optimizing performance constraint approach for nonreal-timesystems with independent tasks. This work focuses on the goal of ordering task execu-tion through the operating system scheduler to adjust idle time length for the multipledevices that a process may utilize during execution. Adjusting idle time length allowsthe scheduler to achieve better opportunities for power management by minimizingstate changes and running processes that depend on these devices simultaneously. Theauthors provide proof that long, clustered idle times provide the best results. The au-thors present a greedy scheduling algorithm that focuses on the power management ofmultiple devices. This algorithm concentrates on the concept of Required Device Sets(RDS) that are those devices in a system required by a task to perform its function. Itis assumed that the scheduler can make predictions of which devices a task will use.The scheduling algorithm attempts to order tasks based on their RDS. Performanceof the multidevice greedy scheduling algorithm is O(n log n). A scheduling simula-tor is used for the performance analysis of the authors’ proposed algorithm under aLinux-based system. Analysis is done using different workloads with various levels

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 14: askdfjnaiow

32:14 H. F. Sheikh et al.

of RDS prediction accuracy and timing constraints. Advantages in energy savings areobserved in all cases that the algorithm performs better in terms of energy with higherRDS prediction and looser timing constraints. Experimentation shows that the authorswere able to achieve power savings up to 33% over base scheduling.

—Joint DVS and Task Assignment. Zhang et al. [2002] present a two-phase frameworkto minimize the energy consumption of real-time tasks. The proposed approach com-bines the tasks’ assignment, time scheduling, and voltage selection to allocate givenreal-time dependent tasks on a set of DVFS-enabled processors. To generate a sched-ule with the best slowing down opportunity (which results in minimum performancedegradation) they first apply Earliest-Deadline-First (EDF) scheduling, by ordering thetasks based on their priority and using a best-fit strategy to allocate tasks to multipleprocessors. Then, the voltage selection problem for the Directed Acyclic Graph (DAG) isformulated as an Integer Programming (IP) problem. The optimal solution is obtainedby solving the IP formulation exactly. By limiting the range of discrete voltage selection,the problem can be solved approximately by a heuristic in polynomial time complexity.The proposed approximation for the problem under discrete voltage settings producesresults with a deviation of at most 3% from the optimal solution.

—Static and Dynamic Voltage Assignment for Single Processor. Aydin et al. [2004]present their solutions to the problem of scheduling real-time, periodic tasks while tak-ing into consideration power consumption and the necessary task completion deadlinesinherent to real-time systems. Their approach is targeted at single-core DVS-capablearchitectures using an intertask scheduling technique. Intertask scheduling performsprocessor speed assignments at task level and at task dispatch and completion time.The authors use a three-level approach: static level, reclaiming level, and speculationlevel. Algorithms for each of these levels and their interactions are given. First, thestatic level is used to compute optimal speeds for tasks, assuming a worst-case work-load for each task arrival. An instance of this problem has been shown by Aydin in aprevious work to be equivalent to solving an instance of the Reward-Based Scheduling(RBS) problem by concave reward functions [Aydin et al. 2001] and that solutions forRBS can be used. However, the authors state that an RBS solution alone would be tooconservative for use as a complete solution to the real-time DVS scheduling problemdue to the amount of variation in actual workloads. The reclaiming level is an onlinemechanism that improves on the static schedule by using the actual workload to re-claim energy. The authors describe their reclaiming algorithm as a “generic dynamicreclaiming algorithm” (GDRA). GDRA is a greedy algorithm that attempts to allocatethe largest possible amount of slack time to the first task satisfying a proper priorityrequirement. Task deadline objectives are maintained by investigating execution timesfor the remaining tasks from the static schedule. Finally, the speculation level is de-scribed as an online speculative mechanism for speed adjustment using average-caseworkload to predict earlier finishing times for next executions. The online, adaptive,and speculative speed adjustment mechanism takes into consideration the questionsof the intensity of aggressiveness that condone speculative speed reductions as well asguarantees about timing constraints in aggressive modes. The authors’ algorithm isunique in that it reduces the speed of dispatched tasks exclusively, borrowing time fromother available tasks. The level of aggressiveness can be adjusted. The results showthat the best solution can be reached by compromising the aggressiveness on speed re-quirements for the expected workloads. Experimental results show that applying thedynamic reclaiming algorithm to the static optimal algorithm provides up to a 50% en-ergy savings over the static optimal algorithm for workloads with significant deviationfrom the worst-case requirements. Adding aggressive techniques through the proposedspeculative adjustment algorithm improves the dynamic reclaiming algorithm results

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 15: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:15

by 10–15%. The author’s AGR2 speculative algorithm closes to the lower bound set byan optimal, theoretical “clairvoyant algorithm” for DVS scheduling by a 10% margin.—Reducing Energy in I/O Devices. In Gniady et al [2004], the authors proposed atechnique to predict I/O utilization using program counters, and to reduce energydissipation by turning off idle devices. In the literature, this technique only focuses ondisk devices; however, the idea can be extended to other devices like those for display orWi-Fi. Motivated by the branch predictor behaviors of HPC CPUs, the novel technique,named PCAP (Program Counter Access Predictor) can analyze past activities of the I/Odevice and recognize behavior patterns, thus deciding the utilization in the near future.If the device turns out to be idle, the predictor can decide to turn it off to reduce powerconsumption. The key innovation of PCAP is the small penalty on energy dissipationwhen mis-prediction happens. Furthermore, the pattern recognition process not onlyanalyzes the device’s statistics of past behaviors, but also takes into account the typeof user and application.

—Hybrid Static and Dynamic Slack Allocation. Zhuo and Chakrabarti [2005] presenta hybrid static and dynamic slack allocation approach combining feedback control withDVS schemes. They target battery-aware task scheduling, based on the assumptionthat all of the tasks’ information, such as deadline, execution time, etc., is alreadygiven, which is common in embedded environments. They propose a novel EDF algo-rithm for dynamic scheduling of periodic tasks. Their algorithm adopts the dynamicversion of an average rate heuristic, and delays slack incorporation so it can achievebetter performance. The experiment shows more significant performance than thenear-optimal result in Albonesi [2002] and with lower time complexity.

—Feedback-Control-Based Scheduling. Zhu and Mueller [2005] proposed a PID (Pro-portional Integrator Differentiator) controller-based approach to modify the EDFscheduling scheme. They target hard real-time systems with dynamic workloads. Withthe help of the operating system they improve the Earliest Deadline First scheduling(EDF) algorithm by incorporating DVFS and feedback techniques. The task is executedin two stages. In the first stage, the voltage/frequency is scaled to keep an average ofthe task execution time. During this stage, the feedback mechanism helps to decide theappropriate voltage/frequency for the task. In the second stage, the scaling dependson the execution status of the first stage to meet the deadline. Their experiment re-sults show that at most 29% energy conservation can be achieved via the two-stagealgorithm over the simple DVFS algorithm.

—EDF and Rate Monotonic Scheduling. Zhang and Chatha [2007] investigate energyefficiency resulting from the DVFS technique in RM (Rate Monotonic) and EDF (Ear-liest Deadline First scheduling) schemes in embedded systems. These two schemesare the most frequently used scheduling algorithms on single processor systems, espe-cially for periodic tasks. RM works under static mode, that assigns the fastest-completetask the highest priority, while EDF dynamically maintains a priority queue of tasksin order of their deadlines. The problem is formulated as indicating the appropriateCPU speed (voltage/frequency) of each task to execute, so that the total energy dissi-pation is minimized without violating the rules defined by RM or EDF. This problem isproved NP-complete. Several works give similar approaches: pseudo-polynomial timein Zhang et al. [2002], and fully polynomial time in Zhang and Chatha [2007] and Chenand Mishra [2009]. The approximation scheme in Zhang and Chatha [2007] showedthe lowest complexity among the three.

—PCEO on Multicore and Distributed platforms.Table III presents selected research for the Performance-Constrained Energy Op-

timization scenario (PCEO). The performance constraint can include response time,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 16: askdfjnaiow

32:16 H. F. Sheikh et al.

Table III. Related Work for PCEO on Multicore and Distributed Platforms

PowerAlgorithmic Performance Control

Paper Approach Definition Task Model Strategy Platform Components[Schmitz and

Al-Hashimi2001]

Heuristic Overhead Dependent DiscreteDVFS

Multicore CPU

[Elnozahyet al. 2002]

Algorithm SLA (responsetime)

Independent DVFSSwitchON/OFF

Cluster,Homogeneous

CPU

[Yu andPrasanna2002]

LR Heuristic Response time Independentperiodic

DiscreteDVS

Multicore CPU

[Ge et al.2005]

Heuristicprofilebased

Executiontime

Independent DVS Cluster, Homo-geneous

CPU

[Mishra et al.2003]

Greedy,Heuristic

Deadline,Precedenceconstraints

Dependent ContinuousDVS

Multicore CPU

[Zhu et al.2003]

Heuristic Deadline,Overhead

Independentdependent,periodic

DVS Multicore CPU

[Stout 2006] Algorithm Executiontime

Dependent DPM Multicore CPU

[Kang andRanka2008b]

Heuristic,Optimal(using LP)

Deadline Independent&dependent

DVS Multicore CPU

[Nathuji 2008] Heuristic Executiontime andQoS

Independent VM consoli-dation

Cluster server

[Qi and Zhu2008]

Heuristic Deadline Real-time DVFS, blockpartitions

CMP(multicore)

CPU

[Seo et al.2008]

Heuristic Deadline Independent DVFS Multicore CPU

[Srikantaiahet al. 2008]

Bin-packingheuristic

Throughput Independent SwitchON/OFF

Data Center,Homoge-neous

CPU

[Ghasemazaret al. 2010]

Heuristic Throughput Independent DVFS, Coreconsolida-tion

CMP(multicore)

CPU

[Petrucci et al.2010]

Heuristic forMIP

Executiontime

Independent VM consoli-dation

Cluster server

throughput, task acceptance rate, or any other quantitative representation of outputof the computational system under consideration.—Genetic Algorithm for Energy Reduction. Schmitz and Al-Hashimi [2001] proposedan energy optimizing heuristic algorithm with performance constraints for offline taskscheduling of dependent tasks on distributed, embedded systems with multiple pro-cessing elements. The authors claim that their approach is unique at the time of writingin that it considers the power profiles and variations of DVS processing elements (re-ferred to in the article as DVS-PEs). Voltage selection is done for each task based onthe power dissipation caused by each task. The algorithm presented accepts a taskgraph, the mapping of tasks onto processing elements, schedule of the tasks and com-munications, execution times, power dissipations, and minimal schedule extension asinputs. Scheduling and mapping techniques are based on genetic algorithms similar toother cited approaches. The runtime of the authors’ algorithm is said to be polynomialand is successful at the identification of refined voltage selections based on the powerdissipation at task level. They implemented their algorithm on a Pentium-III-basedLinux system and performed experiments for two and four processing elements. Their

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 17: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:17

experiments showed that there is an advantage to taking power variations of process-ing elements into consideration and that their algorithm performs voltage selectionssuccessfully when compared to DVS algorithms that do not take variations into ac-count. A significant energy consumption reduction up to 80.7% is observed for designspace exploration when a genetic algorithm adopts the proposed heuristic.

—Multiple Power Management Policies. In one of the investigations on power efficiencyof clustering Elnozahy et al. [2002] proposed the power-aware resource managementmethod to reduce the overhead brought by unnecessary operating cost in the pure Web-service environment (Web servers). The performance model is based on SLA (Service-Level Agreements), which can be measured by response time in this model, and thedegree of load-balancing for a homogeneous cluster. The power model is based on DVFSon the CPU and switching physical nodes on and off (VOVO). Five power adaptationpolicies are applied on the resource management strategy: VOVO (Vary On and VaryOff), CVS (Coordinated Voltage Scaling), IVS (Independent Voltage Scaling), VOVO-IVS (Combined Policy) and VOVO-CVS (Coordinated Combined Policy), among whichVOVO-CVS is stated to be the most advanced policy. VOVO-CVS dynamically decidesthe CPU frequency thresholds then determines the schedule to turn on or turn offnodes. So the overall CPU’s frequency can then be estimated to compute the expectedresponse time. Compared to the performance constraint, the optimal amount of nodesand CPU frequency of these nodes can be acquired. These policies are tested on asingle-application cluster environment. The results show that while energy savingsachieved by VOVO are dependent on workload intensity, VOVO-CVS outperforms allthe tested approaches in terms of energy savings. The energy savings obtained throughVOVO-CVS are shown to be up to 18.2% better than VOVO.

—Integer Linear Programming. In Yu and Prasanna [2002], the authors study theproblem of allocating independent periodic tasks to a real-time computing environ-ment, that is heterogeneous. DVFS is supported by each computing component in thesystem. In general, this allocation problem is NP-complete. First an ILP (Integer Lin-ear Programming) formulation for the allocation problem is presented [Aydin et al.2004; Yu and Prasanna 2002]. An extended linearization heuristic (LR-Heuristic)[Aydin et al. 2004; Yu and Prasanna 2002] is then used for solving the problem. Theythen analyze the applicability of the heuristic which can work just within a limitednumber of tasks. The experimental results show that the greedy approach can be up to90% off from the optimal while the maximum deviation for LR-Heuristic is only 15%for small size problems. However, for large size problems, LR-Heuristic can be at most40% better than the greedy approach.

—Distributed DVS Scheduling. In Ge et al. [2005], the authors propose a performance-centralized DVFS scaling mechanism applied on the power-aware distributed high-performance computing cluster. Within the cluster, each member has multiple power-performance modes determined by scaling techniques such as DVFS. Performance(measured by execution time) and energy usage can be derived by duration in themodes and time spent on transition between the modes. They further investigate dis-tributed DVFS techniques for task scheduling in power-aware cluster systems. Eitherthe system-driven DVFS (driven by CPU speed) or the user-driven DVFS (driven bycommand line settings) is transparent to the applications. Moreover, DVFS can alsobe driven by source-code instructions with precomputed performance profiles. The pre-sented results show that considerable energy savings can be achieved via distributedDVFS scheduling (maximum of 36% savings) and at the same time the performanceis not degraded. However, the amount of energy conservation may be different amongcompositions of various workloads, application types, and system configurations. The

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 18: askdfjnaiow

32:18 H. F. Sheikh et al.

results also corroborate the effect of external DVFS by the observation that in mostcases, the cost of complex internal scheduling can’t obtain more significant energysavings than the straightforward external scheduling method, like user-driven DVFS.

—Combined Static and Dynamic Slack Allocation. In Mishra et al. [2003], the authorsinvestigate the slack allocation problem with the scope of power management in thedistributed real-time system. They assume that the task model is communication in-tensive and allow dependency relationships between tasks. They proposed a two-stepslack allocation policy on the existing scheduling queue generated from a scheduleprocess [Selvakumar and Siva Ram Murthy 1994]. The first step, named P-SPM (staticpower management for parallelism), executes static assignment of slacks (slacks of thewhole queue) to parts of the scheduling queue (the queue is divided into parts by differ-ent parallelism levels). The second step dynamically allocates slacks along the queuevia a greedy selection strategy: always choose the first ready-to-run task to allocateslacks. Note that this greedy “gap filling” strategy may change the initial schedulingorder. The experiment results show that the static slack assignment algorithm mayachieve 10% mean energy conservation compared to the early slack assigning methodthat simply assigns slacks proportional to the length of tasks. Further, the combinationof the proposed static and dynamic algorithm can obtain more energy conservation.

—Slack Sharing and Reclamation. In Zhu et al. [2003], the authors propose two energy-aware slack reclamation algorithms for task scheduling in multicore systems, whichcan support both independent and dependent task models. The main idea of theirslack reclamation technique is to reallocate the unused time of the tasks that com-plete early to other in-run tasks so the CPU speed can be slowed down and achievetotal energy conservation. In Zhu et al. [2003], they first extend the greedy algorithmissued in Azevedo et al. [2001] to the scope of global scheduling defined in Albonesi[2002] and argue that it exceeds (or violates) the deadline. Instead, they introduce theshared slack reclamation algorithm, a new global slack reclamation solution. However,the shared slack reclamation is not perfect because when this algorithm is appliedon list scheduling, the deadline constraint may not be violated either. Further, theypropose a modified solution, FLSSR, that finishes execution of tasks before the dead-line. They also take into account the overhead of transition between different levels ofvoltage or frequency because in practice, DVFS scaling is discrete rather than contin-uous. The experiments show that their algorithm can obtain almost the same energyconservation via discrete DVFS as with continuous DVFS.

—Peak Power Minimization. In the parallel computing system, energy consumptionmay not be the most in need of care, because the energy is supplied by an externalsource and normally does not decrease with time. But it is of interest whether therequired peak power will readily be available from the external source. Stout [2006]investigates the problem of peak power minimization from the perspective of the par-allel algorithm working in a grid composed of a number of small and simple processors,among which each processor is connected with its neighbors. Such a structure is com-mon in sensor networks, cellular automata, and some supercomputer systems. Thestandard mesh-based parallel algorithms work on the assumption that all processorswork simultaneously, but this assumption is unrealistic. Based on this fact, Stout [2006]designs a near-optimal algorithm to reduce the peak power for some basic problems,including labeling elements in an image and calculating the distance between them,calculating the minimum spanning tree of a graph, or determining whether the graphis biconnected. To find processors which are running simultaneously at a given timeinstant, this article introduces the “squirrel” model. A squirrel represents an active pro-cessor. Squirrels carry information in a limited number of words; they can track their

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 19: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:19

location and leave information in a location. The communication paths among proces-sors can be obtained by tracking squirrels’ movement. So the problem is converted intominimizing the number of squirrels in the given time.—Static and Dynamic Slack Assignment. Kang and Ranka [2008b] explored novel algo-rithms for scheduling DAGs on DVS-enabled parallel and distributed machines in bothstatic and dynamic environments for minimizing energy. The proposed schemes canbe used for both homogenous and heterogeneous parallel machines. The scheduling ofDAG-based applications with the goal of DVS-based energy minimization broadly con-sists of two steps: assignment and slack allocation. They present algorithms for staticslack allocation, dynamic slack allocation, static assignment, and dynamic assignment.Kang and Ranka [2008b] describe an LP-based approach for slack allocation, with thegoal of minimizing total energy consumption under the constraints of deadlines. Theyextend this version by considering the data transfer time in the form of precedencerelationships among tasks. Their scheme improves running time and memory require-ments of the LP-based approach by combining compatible task matrix and search spacereduction techniques. The dynamic slack allocation algorithm is built on the approachof using the k descendent look-ahead approach. However, for readjusting a scheduleit only considers the directly influenced tasks consequential to an early or late finishtime of a certain task rather than taking into account all the tasks within a certainrange of time. This approach outperforms some of the previous approaches [Felter etal. 2005; Li 2008] both in terms of time and memory in both overestimation (i.e., thetask finishes earlier than its estimated time) and underestimation cases (i.e., the taskfinishes later than its estimated time).—Power Management with Virtual Machine Consolidation. Nathuji et al. [2008] in-vestigate the possibility of adopting virtualization techniques to reduce power con-sumption in distributed environments. The authors observe that, although switchingbetween the “hard power states” of machine hardware can scale down the supply volt-age and so reduce power consumption, the range between two neighbor hard powerstates can still be split via software techniques. The finer-grained ranges obviouslycan improve power efficiency. They find that this method can be achieved by utilizingvirtualization techniques supported by distributed operating systems. To evaluate thepower efficiency of VMs, the authors design a framework software integrated with theXEN hypervisor to coordinate VM scheduling on clustered servers. For VMs runningon machines with a power-aware operating system, the authors define “local policies”to schedule the VM allocation locally to optimize power efficiency with execution timeor QoS constraints. By extending the power-aware task scheduling strategies for non-virtualized machines, the local policies can be explicit and practical. However, for theoperating system without support for hard power states, the framework should providethe functionality to remotely manage the VM activities as migration or idling to meetpower requirements. These “global policies” are attractive but not clearly illustrated.—Block Partitioning of Multicore Systems with DVFS. The work in Qi and Zhu [2008]explores the aspect of core consolidation at hardware level, where it is assumed that thecores of the system are grouped / partitioned into several blocks. Each block can then becontrolled by a separate power supply and thus can change its frequency/voltage inde-pendently of other blocks. The main motivation is to avoid the excessive complexity ofper-core control as well as excessive power consumption due to common frequency scal-ing. Similar to Ghasemazar et al. [2010], a two-phase approach was used. In the firstphase, static slack was used to minimize the number of blocks required to complete thetask. Various configurations of symmetric and asymmetric partitions were evaluated interms of their energy efficiency to find the optimal partitioning approach. As expected,having small number of cores per block significantly improves energy consumption.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 20: askdfjnaiow

32:20 H. F. Sheikh et al.

However, the difference in energy by having 1 or 4 cores / block was not significant,therefore, it can be deduced that a certain small number of cores/block can guaranteegood energy efficiency. Dynamic slack was distributed at runtime among all the activecores by reducing their voltage/frequency setting and thereby improving energy con-sumption. Synthetic workloads comprised of real-time tasks were used to evaluate theefficiency of the proposed approach. A baseline approach where all cores are run atmaximum frequency and are switched-off when idle was used for comparison.

—Dynamic Partition and Core Scaling. Seo et al. [2008] present two algorithms, dy-namic repartitioning and dynamic core scaling, as solutions to efficiently reduce clockfrequencies and reduce leakage power by managing active cores, respectively. Dynamicrepartitioning for multicore processors is said by the authors to be the same as produc-ing partitioned task sets that perform the most balanced utilization of the cores. Dueto demand on the cores changing during runtime, some tasks must be migrated on-the-fly in order to maintain the balanced performance demand and consistently havelow power consumption. To implement this, it is first determined if a task can safely bemigrated from one core to another by an analysis described by the authors. One of thesafe temporal migration conditions is making sure that a repartitioning won’t violateany deadlines of the original schedule. If conditions are met, the dynamic repartition-ing algorithm will migrate the task with the lowest required utilization from the corewith the maximum load to the least loaded core. The migration is continued until thedifference in load between the most utilized and least utilized cores is less than theremaining dynamic utilization of the task currently running on the most utilized core.The task to be migrated is the currently scheduled task on the most utilized core. Thedynamic core scaling algorithm decides the optimal number of active cores on-the-flyby taking advantage of most commercial multicore processors’ ability to transition acore to a given ACPI processor state independently. Based on these expectation func-tions the power-optimal number of active cores can be different from the appropriatenumber of cores based on a system’s parameters. This problem is of NP-hard complex-ity and gives rise to the need of their heuristic algorithm for finding a near-optimalsolution. Their algorithm is run at the start of a new task period and is rerun for theupdated dynamic utilization after a task completes. If the optimal number of cores isless than the currently active number, the core with the lowest dynamic utilization hasits tasks migrated prior to deactivation. The core with the lowest dynamic utilizationis chosen because it may have the least number of tasks to migrate and migration ofthese tasks will have the least impact on other cores. Also, a core may be activated ifit is decided that fewer cores are active than the optimal number based on an increasein dynamic utilization. If activation is needed, dynamic repartitioning is performed tomigrate tasks from core with highest load to the newly activated core.

—Resource Utilization Bin Packing. Srikantaiah et al. [2008] propose dynamic consol-idation of applications in a heterogeneous data center to minimize energy dissipationwhile satisfying performance requirements. The authors analyze the workload con-solidation by a metric of energy per transaction with the data of CPU and storageutilizations. The result shows that the consolidation status can impact the relation-ship between energy consumption and resource utilization. They found the energyconsumption per transaction development appears as a U-shape: low utilization leadsto a high fraction of servers in an idle state so the energy-performance metric is high;on the contrary, high utilization can increase scheduling conflicts, context switches,and cache misses so performance is degraded and execution times extended underhigh energy consumption. Thus, they observe the optimal balance between resourceutilization and energy-performance metrics resides around 50% on the CPU and 70%storage usage. They proposed a dynamic consolidation via a bin packing algorithm to

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 21: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:21

achieve optimal resource utilization of the server. The algorithm abstracts the serversinto bins with multiple dimensions so these dimensions can represent the resources ofinterest like CPU, memory, network, and storage). The length along each dimensioncan be obtained by computation from measurements of the corresponding resource intoan optimal utilization level. Also, the applications are represented as objects and theestimated resource utilizations are represented as the proportional length along eachdimension of the object. So the problem is abstracted as minimizing the amount of binsto accommodate as many objects as possible, thus the total energy dissipation can beminimized by reducing the number of active nodes and turn off the idle ones. To find asolution to this problem, a heuristic is introduced: once a new request (object) arrives,find the server (bin) to allocate the request so that the bin has the minimum space left.From the heuristic it is apparent that an idle server can be activated only if all theactive servers are fully utilized. The authors find that experiments using the heuristiccan achieve near-optimal energy conservation (5.4% more energy dissipation than thetheoretical optimal solution).

—Combination of Core Consolidation and DVFS. The idea of minimizing energy con-sumed in a chip multiprocessor by combining chip consolidation along with DVFS isexplored in Ghasemazar et al. [2010]. A mixed integer program has been formulated forminimizing energy under a given throughput constraint. A two-phase heuristic whichtackles the problem in hierarchical manner is proposed to efficiently solve the afore-mentioned optimization problem. In order to minimize the additional static power, theproposed heuristics first aims to determine the optimal number of cores that should bein “ON” state based on the throughput requirements; this is achieved by following asteepest descent approach. It then adjusts the voltage setting of each core to optimizethe energy consumption. The optimal value of voltage/ frequency setting is obtainedby using a simple PI-controller. It is assumed that all cores can run at the same fre-quency at a given time, hence, a single value of optimal voltage/frequency was obtainedat each decision step. Experiments are conducted by selecting applications from theSPEC2000 benchmark uniformly on a chip multiprocessor with each core similar to thealpha-21264 processor. Results highlight a 17% improvement in energy savings over adynamic power management approach that does not use core consolidation and usesan open loop controller for DVFS.

—Mixed Integer Programming for Virtualization Environments. In Petrucci et al.[2010], the authors address performance-constrained power optimization in virtual-ized server clusters. Their solution includes an optimization MIP model and a dynamicconfiguration strategy. In the static assignment stage, the applications at any time areallocated to run on only one VM per each physical server; then the dynamic optimiza-tion mechanism allows an application to use multiple VMs over distributed servers.The dynamic optimization control periodically selects and enforces the lowest powerconsumption configuration (derived by the MIP model) that maintains required per-formance under a variable workload of multiple applications for the cluster. Thus thewhole system can keep near to the optimal operating point under the controlling loop.The authors also present a framework that provides several configuration mechanismsin the form of monitoring system execution, evaluating system requirement violation,and configuration adaptation.

From preceding related work, dynamic scaling techniques like DVFS are widelyused for PCEO problems. Because of its quadratic relationship between CMOS powerand clock rate, a tiny increase of execution time (or violation of SLA) can achievesubstantial reduction of power dissipation. However, the efficiency of DVFS can besignificantly affected by the granularity of the voltage or frequency on the systems. In

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 22: askdfjnaiow

32:22 H. F. Sheikh et al.

Table IV. PCEO on Single Processor Platforms

PowerAlgorithmic Performance Control

Paper Approach Definition Task Model Strategy Platform Components[Alenawy

and Aydin2005]

Heuristic (m-k) FirmDeadline,DynamicFailure Ratio(DFR)

Real-time,Independent

DVFS SingleProcessor

CPU

[Devadaset al. 2009]

Heuristic/TheoreticalAnalysis

CompetitiveFactor / valuemetric

Real-time,Independent

OnlineScheduling,DVFS

SingleProcessor

CPU

[Ranvijayet al. 2010]

Heuristic Acceptanceratio

Real-time,Independent

Energy-awareSchedulingDVFS,

SingleProcessor

CPU

most scenarios, DVFS can be utilized to solve PCEO problems via continuous convexoptimization techniques but the slack granularity directly determines the optimalityof the solution. Another general idea for PCEO problems is resource consolidation,including CPU utilization, server resources, and virtual machines, which can reduceoverall power consumption by minimizing the number of active components in multi-core or distributed environments. The consolidation problems can be fit into classicaloptimization problems like bin-packing, knapsack, integer programming, and so on.Motivated by the heuristics to solve these problems, a lot of algorithms are devised tofind the local optimal solutions of PCEO problems, which are practical in most casescomparing to the high overhead to find a global optimal or bounded approximate solu-tion. However, most of these solutions are offline; hence online techniques, as well asrelevant methods of statistics and prediction, are attracting state-of-the-art researchinterest in PCEO algorithms.

4.2.2. Energy-Constrained Performance Optimization (ECPO) Algorithms. The energy-constrained performance optimization problem is primarily motivated by the pros-perity of portable and mobile devices. These devices relying on battery power have thestrict limitation of power usage. Furthermore, ECPO problems are observed in lotsof real-time systems, where the performance metrics are represented as Quality-of-Service (QoS) or the combination of QoS and execution time. Table IV presents somerelated work on ECPO for single processor systems while Table V outlines the relatedalgorithms for multicore and distributed platforms.—ECPO on Single Processor Platforms.

—Energy-Constrained Scheduling for Weakly Hard Real-Time Systems. Minimizationof the dynamic failure ratio for weakly hard real-time systems when subjected to tightenergy budgets using frequency selection and slack reclamation is proposed in Ale-nawy and Aydin [2005]. For an (m-k) firm deadline, the timing constraint of m tasksout of k instantiations of a particular task is to be met rather than the deadlines of alltasks. Two static frequency setting approaches, namely the greedy and energy-densityscheme, have been designed to minimize the Dynamic Failure Ratio (DFR) while atthe same time staying under the available energy budget. DFR is the weighted ratioof number of failures for a periodic task to the dynamic failures of all the tasks duringa specified interval of operation. The weight is assigned based on the relative impor-tance of a task among the whole task set. First it is proved that the problem of checkingan (m,k)-firm deadline for a given periodic real-time tasks set under an imposed en-ergy constraint is NP-hard. Then a conservative nominal speed for executing the taskset based on the computational requirement of the task set is presented. However,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 23: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:23

Table V. ECPO on Multicore and Distributed Platforms

PowerAlgorithmic Performance Control

Paper Approach Definition Task Model Platform Strategy Components[Felter et al.

2005]Heuristic

(Workloadaware)

Execution time Independent Singleprocessor

DPM CPU,Memory

[Li 2008] Heuristic(Profilebased)

Schedule length Independent Multipleprocessors

DVS CPU

[Ahmad et al.2009]

Heuristic Schedule length Dependent Multicore DVS CPU

[Gandhiet al. 2009]

Queuingtheory

Response time,Staticallocation

Independent Server farmheteroge-neous

DVFS CPU

[Lee andZomaya2009]

Heuristic Makespan,Staticschedulingwith dynamicadjustments

Dependent HCS (hetero-geneouscomputingsystems)

DVS CPU

the conservative nominal speed (also called the utilization-based speed) does not takeinto account the energy constraint and the fact that only the (m-K)-deadline has tobe satisfied. The greedy scheme improves the nominal speed/frequency statically byconsidering the processor’s demand requirement for completing m instances out of kfor each task. For this, the load requirements of only mandatory tasks (m for each task)are taken into consideration. To select m-mandatory tasks, a simple algorithm called“deeply-red” is used, that executes the first m instances of each task and skips thenext k-m tasks. The greedy scheme with modified nominal speed selection is furtherimproved by using a dynamic slack reclamation (called DRA [Aydin et al. 2004]) pro-cedure which computes earliness after tasks are completed and adjusts the nominalspeed for the next tasks. In the case of a very strict energy constraint, such that it is notpossible to meet the (m,k)-firm deadline for all the tasks, the greedy scheme will notperform satisfactorily. Another static heuristic called Energy-Density Scheme (EDS)is proposed to provide better solutions under such conditions. EDS prioritizes tasksbased on their energy-density value which is the ratio of energy requirements for themandatory instances to the weighted maximum dynamic failures for each task. Thisscheme calculates the nominal speed by using the load requirements of the currentlyselected tasks only, rather than for all the mandatory tasks. EDS also allows for dy-namic slack reclamation and the extra energy accumulated during execution that canbe used to accept additional mandatory tasks. Both the schemes were evaluated withand without slack reclamation on randomly generated periodic real-time tasks withvarying parameters under both the worst-case execution and actual execution scenar-ios. Energy-density schemes with nominal speed calculated based on the processor’sdemand for the selected tasks attain smaller or comparable dynamic failure ratios withrespect to greedy and a modified distance-priority-based (DPB) algorithm.

—Competitive Analysis of Real-Time Scheduling Under Energy Constraints. A theoret-ical study to evaluate the scheduling of real-time tasks under hard energy constraintsis presented in Devadas et al. [2009]. Competitive analysis for online, semi-online,and semi-online with DVFS scheduling algorithms is illustrated using an adversarymethod. The paper starts with discussing the performance of the regular EDF algo-rithm under a given energy budget. It is shown that if the energy budget is greater thanthe minimum required energy for execution of a task set then EDF is optimal. However,as the energy budget is constrained, EDF performs worst by providing no guarantee

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 24: askdfjnaiow

32:24 H. F. Sheikh et al.

even for obtaining nonzero total value (value of a job is a quantitative representative ofthe worth of executing a task, obtained only if the task is completely executed). Hencethe authors put forth the argument that EDF cannot provide nonzero competitivefactor under tight energy constraints followed by a proof on the upper bound on theperformance of any online real-time scheduling algorithm in underloaded conditions (interms of energy). An online real-time scheduling algorithm called EC-EDF is then pre-sented that evaluates the possible completion of the task before admitting the task forexecution. It is then proved that within the given energy budget EC-EDF always com-pletes all the admitted jobs and hence does achieve nonzero total value and competitivefactor. A semi-online algorithm called EC-EDF∗ equipped with additional informationabout the largest task size is presented to obtain a guarantee on the competitive factor.It is shown that EC-EDF∗ can achieve a constant competitive factor of 0.5 (ideal being1), and no other semionline algorithm can achieve a better competitive factor than 0.5.The aforesaid bounds were established for underloaded systems and with the assump-tion that the value density of jobs is uniform. However, the algorithm is later evaluatedunder nonuniform value settings and is shown to have a competitive factor of 1/2

∗kmax,where kmax is the largest value associated with any task. EC-EDF∗ is then extendedto the case where the processor is equipped with DVFS thus tasks can be executed atdifferent frequency. This approach called β-EC-EDF can achieve a competitive factorof 1, given twice as much as energy as the adversary. Important theoretical results arepresented in this article regarding energy-constrained scheduling or real-time tasksbut no experimental /simulation results are presented.

—Window-Based Lazy Scheduling for Real-Time Systems. Recently a window-basedscheduling approach for real-time systems is introduced in Ranvijay et al. [2010].The authors argue that a simple greedy-based scheduling scheme in the presenceof energy constraints may result in higher rejection ratios. Specifically, schedulinga task for execution immediately on its arrival can constrain another high-prioritytask from completion. Under scheduling with preemption, this case may result in apoor acceptance ratio. The authors propose a window-based lazy scheduling algorithmwhere the scheduling decision is governed by both the energy constraints and thedeadline of the task. Based on the energy budget and deadline, a task can be deferredfor execution while allowing other tasks with earlier deadlines to be scheduled. A slackreclamation approach for saving energy by slowing down tasks with total response timeless than the window size is also proposed. Simulations are conducted by generatingaperiodic tasks randomly using an exponential distribution. Results are compared forthe window-based lazy approach with and without DVFS, greedy EDF with and withoutDVFS, as well as for a DVFS-based algorithm for sporadic tasks. Though window-basedlazy scheduling reportedly achieves better energy consumption and acceptance ratio,a detailed analysis of the algorithms is not presented, which affects the extensiveapplicability of the proposed idea.

—ECPO on Multicore and Distributed Platforms.

We now present brief details of selected related work (Table V) for energy-constrainedperformance optimization in this section.—Power Shifting. Felter et al. [2005] discuss and demonstrate a power shifting methodfor controlling component power consumption while minimizing the impact on per-formance in server systems. Power shifting distributes power among components (in-cluding memory and processing units) dynamically using workload-sensitive policies.In turn, a minimal degradation in performance can reduce the power budgets forworkloads, or alternately, enables the system to improve performance under a givenpower budget. Power shifting, claim the authors, involves dynamic management in the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 25: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:25

division of an entire system’s power budget among CPU and memory, citing that theseare the components of a system that consume the most energy during operation. Twoobservations behind the idea of power shifting are that components and overall systemactivity vary with workload and that components of a system are typically not fullyutilized at the same time.

—Energy-Constrained Combinatorial Optimization. Li [2008] provides a combinatorialoptimization approach for solving the problem of task scheduling on DVFS-enabledmultiprocessor systems. The author addresses two problems: energy-constrained per-formance (schedule length) optimization and performance-constrained (given schedulelength) energy minimization. These problems are approached as a sum of powers prob-lem with scheduling tasks and power supply determination taken as two subproblems.First, the study shows equivalence of minimizing schedule length and minimizing en-ergy consumption with the sum of powers problem. Given task execution requirementsas the number of CPU cycles for each task, finding the minimal schedule length withan energy consumption constraint involves finding power supplies for each task anda nonpreemptive schedule for the tasks spread over multiple processors. Of these twocomponents of the problem, scheduling tasks involves partitioning tasks into sets to beexecuted on a processor while determination of power supplies requires to minimizethe schedule length under the given energy budget. This can be performed after thepartitioning is completed. To begin, a uniprocessor computer with an energy constraintis considered. Minimizing the schedule length in this case only involves finding powersupplies for those tasks that produce the minimal schedule while not violating theenergy constraint. Next, minimizing the schedule length is extended to multiproces-sors. It is shown that when a task set is partitioned and divided among processors,finding power supplies for each task that minimizes schedule length is equivalent tofinding the total energy consumption of each processor’s assigned task set that min-imizes the schedule length. Given task execution requirements, finding a schedulethat consumes minimal energy with a schedule length constraint on a multiprocessorsystem involves finding power supplies and nonpreemptive schedule, as before. Firstthe author considers a uniprocessor system finding power supplies that minimize con-sumption while keeping execution length below the constraint. This idea is extendedto multiprocessor systems using a similar method as before. Second, Li shows theNP-hardness of these optimization problems followed by lower bounds for the opti-mal schedule length and minimal power consumption. NP-hardness is demonstratedby showing that the problems mentioned previously are equivalent to sum of powersproblems and using a reduction from the well-known partition problem. Lower boundsfor optimal schedule length and minimal power consumption are used to show thatoptimizing the power-performance product is done by fixing one factor and minimizingthe other. These lower bounds are used to benchmark the heuristic algorithm. Next,the author proposes variations of the list scheduling algorithm to determine partitions(schedules or sum of powers) for each processor. Classic list scheduling approaches, in-cluding Largest-Requirement-First-List-Scheduling and Smallest-Requirement-First-List-Scheduling and how to apply them to this problem are discussed. Finally, equalspeed algorithms are considered. An equal speed algorithm supplies all tasks with thesame power and speed and is a prepower determination algorithm (power supplies aredetermined before the schedule is found).

—Iterative Voltage Adjustments. Ahmad et al. [2009] propose a static voltage adap-tation (ISVA) algorithm that minimizes energy requirements while allocating volt-ages to the subtasks of an application represented by a DAG. An initial scheduleis first generated using an existing efficient algorithm without taking the energy con-straint into consideration. Next, ISVA computes each task’s relative importance and the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 26: askdfjnaiow

32:26 H. F. Sheikh et al.

corresponding energy burden. It then adjusts the schedule to achieve the best possibleschedule under the given energy budget. The inputs to ISVA are a task graph, thenumber of processors, DVFS levels of the processors, and an energy budget. Using anefficient list scheduling algorithm such as DCP [Ahmad and Kwok 1998], the ISVAgenerates a schedule with all tasks allocated the lowest available voltage level. Theenergy consumption for the schedule is recalculated and if this one does not exceedthe energy budget, the algorithm proceeds. From the tasks which are not allocated themaximum voltage level, ISVA selects tasks by turn and increases their voltage level.The task with the incremented voltage level is called the candidate task. The earlieststarting task scheduled to run at the lowest voltage level among all the processors isselected for adjustment in case there is no candidate task. ISVA stops when an increasein the voltage level of a task results in exceeding the allocated energy budget.

—Queuing Model for Optimal Power Allocation. In Gandhi et al. [2009], the authors in-vestigate the optimization problem of minimizing the average response time of serverswith a given total energy budget on the platform of a heterogeneous high-performanceserver farm. They begin with an analysis of the relationship between scaling CPUfrequency and dynamic power dissipation. They apply the same CPU-bound workloadsto three different frequency scaling techniques of DVFS (ACPI P states), DFS (ACPIT states) and mixed DFS+DVFS (fine-grained T states and coarse-grained P states).From the results, both DVFS and DFS showed a linear relationship between dynamicpower consumption and CPU frequency while DFS+DVFS showed a cubic relation-ship. They then investigate the optimal power allocation which is represented as theoptimal CPU frequencies with the minimum mean response time. The authors assumethe system can be simulated by a queuing-theoretic model so mean response time canbe predicted by taking into consideration factors such as the power-frequency relation-ship, arrival rate, peak power budget, and so on. Then the model can estimate theoptimal power allocation under each possible combination of the different values ofthese factors. Under different workload intensities, this model gives out different solu-tions of power allocation. The results show that the best performance under a powerconstraint may not have tasks on a small number of servers at maximum frequency;a large amount of servers running at lower performance levels may be the optimalsolution in some cases.

—Energy-Conscious Scheduling Heuristic with Makespan-Conservative Energy Re-duction. Lee and Zomaya investigated the task scheduling problem on HeterogeneousComputing Systems (HCSs) [Lee and Zomaya 2009]. The application model is basedon precedence-constrained tasks so the communication costs are accounted for intothe model. They proposed an Energy-Conscious Scheduling heuristic (ECS) for looselycoupled HCSs (e.g., grids and clouds) using advance reservation and multiple setsof frequency-voltage settings. ECS is devised with the incorporation of DVS to re-duce energy consumption; this means that a trade-off exits between the quality ofschedules (makespan) and energy consumption. The precedence-constrained tasks areconstructed as a DAG with nodes representing tasks and directed edges representingprecedence relations between nodes. Computation costs are represented as weightson nodes, while communication costs are represented as weights on edges. The algo-rithm consists of two typical phases: a static scheduling phase and an energy reductionphase. In the first phase, ECS is executed repeatedly to formulate a balanced schedulewith the objective dealing with the trade-off between performance and energy. In thisphase, task-to-processor (machine) mappings are also constructed. In the second phase,the Makespan-Conservative Energy Reduction technique (MCER) is incorporated intoECS. In this phase, the initial schedule generated in phase-I is scrutinized to identify

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 27: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:27

Table VI. DEPO on Multicore and Distributed Platforms

PowerAlgorithmic Performance Control

Paper Approach Definition Task Model Platform Strategy Components[Malik et al.

2000]Heuristic System

UtilizationBenchmark

(Powerstone)Multi-core Cache

TuningCache

[Kremer2000]

Heuristic,Profile based,compilerdirected

Taskcompletion

Dependent Singleprocessor

Remote taskexecution

CPU

[Azevedo etal. 2001]

Heuristic,Profile based,compilerdirected

Hit/Miss rate Independent Singleprocessor

DiscreteDVFS

CPU

[Zhang et al.2005]

Heuristic,profile based

Hit/Miss rate Benchmark(Powerstone,Mediabench,SPEC2000)

Singleprocessor

Cachetuning

Cache

whether any changes to the schedule further reduce energy consumption without anincrease in makespan.

For ECPO problems, workload consolidation techniques are not as important as thosefor PCEO problems. It is observed that improvement of power estimation or predictiontechniques for tasks and computer systems can also improve the system performancefrom the previously cited related work. Comparing to the static scheduling methods,which solely consider the energy constraint, the dynamic scheduling approaches in-cluding slack reclamation may attract more attention for ECPO problems in the fu-ture. However, since ECPO can have a strict energy threshold and violation of thisthreshold can’t be afforded in many cases, the energy overhead of dynamic allocationand migration should be carefully evaluated. Furthermore, not only the cumulativeenergy constraint, but the threshold of peak power, especially for some power-sensitivecomputer systems, should also be taken into account.

4.2.3. Dual Energy and Performance Optimization (DEPO) Algorithms. The dual optimizationof both performance and energy consumption can be considered as multiobjective op-timization with multiple constraints. Trade-offs are evaluated for different scenarios,and a summary of various characteristics of research efforts for DEPO targeted forsingle processor systems as well as multicore and distributed systems is presented inTable VI and Table VII, respectively.

—DEPO on Single Processor Platforms.

—Compiler-Driven Optimization. Kremer, Hicks, and Rehg [2000] make use ofcompile-time program analysis to offload tasks involved in face recognition anddetection from embedded or other battery-driven devices to servers. They aim to mini-mize the energy consumption while also minimizing some performance penalty. Duringprogram compilation, phases of face detection are broken down into tasks and used toconstruct a DAG. Then a decision is made about the benefit of potentially executingthe tasks on a remote server rather than the local device. Once the compiling phaseis completed, the client can be kept informed about the progress of a task executedon the server. In the case of a network disconnection, the client can continue execu-tion individually without the help of a server. Another optimization solution drivenby compiler events is proposed by Azevedo et al. [2001]. They use a DVFS techniqueto dynamically manage power via compiler-driven strategies. Their COPPER project(compiler-controlled continuous power performance) makes use of the GCC compiler,Wattch simulator with an updated power profiler and power scheduler modules, and

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 28: askdfjnaiow

32:28 H. F. Sheikh et al.

Table VII. DEPO on Multicore and Distributed Platforms

PowerAlgorithmic Performance Control

Paper Approach Definition Task Model Platform Strategy Components[Ahmad et al.

2008]Pareto

Optimal(NBS)

Makespan Independent Multicore,Grid Het-erogeneous

DVFS CPU

[Liu et al.2008]

Heuristic TimingConstraint

PeriodicDAGs

MPSoC DVFS, Taskretiming

CPU

[Verma et al.2008]

Bin-Packing SLA Independent Virtualizedcluster

VM consoli-dation

Server

[Bao et al.2009]

Heuristics forMIP

Throughputand executiontime

Dependent Multicore DVFS CPU

[Kusic et al.2009]

Predictivecontroltheory

SLA Independent Virtualizedcluster

VM consoli-dation

Server

[Lammieet al. 2009]

Heuristics Turnaroundtime

Independent(GridWorkload)

MulticoreCluster

Frequencyand NodeScaling

CPU

SimpleScalar to enable analysis at the levels of code generation up to the simulatedexecution. Compiler-generated configuration code is embedded into an application toproduce different versions of the code to be selected at runtime. A power schedulermakes choices based on the available power profile. Scheduling heuristics are usedto predict the dissipation of power by an application using the “ahead-of-time” powerprofile. To effectively address both issues of power and performance control simultane-ously, additional controls must be put in place. To achieve the additional goal of timeconstraints, program checkpoints are introduced at specific locations in the code. Timeconstraints are then set for the amount of time (acceptable upper and lower bounds) theprogram should take between checkpoints. A heuristic approach involving two profilingphases for time and a power scheduling phase is discussed to meet both goals.

—Cache Tuning Methods. As caches contribute significantly to the system’s powerconsumption, people have thought out ways to make cache configurable to apply ondifferent applications. Zhang et al. [2005] introduce a configurable cache architecturewhere cache can be tuned on associativity, total size, and line size. The overall per-formance and size overhead of these configurable features is significantly small. Byincorporating simple configurable circuits, the power savings gained from configura-tion can reduce overall system power as much as up to 40%. Malik et al. [2000] alsopresent cache-tuning solutions for an M • CORE M3 architecture controlled via a cachecontrol register (CACR), different cache-tuning features as programmable modes ofoperations, write modes (write-through and copy-back) changing, way managementcontrol, and data caching policy adjustment. Followed by the benchmark analysis, anoptimal performance/power consumption profile can be generated using the aforesaidfeatures.

—DEPO on Multicore and Distributed Platforms.

—Game-Theoretical Scheduling Algorithm. Ahmad et al. [2008] solve the Energy-Aware Task Allocation (EATA) problem for a form of the third scenario by designingan algorithm with concepts from cooperative game theory using the Nash Bargain-ing Solution (NBS), named as NBS-EATA. A cooperative game is played among thecores, such that they compete with each other to grab tasks for maximizing their profit(based on makespan and power reduction). The desired outcome of such a game is atask-to-core mapping which benefits the whole processing system. The problem with

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 29: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:29

the DEP-OPT function (assuming heterogeneous cores and also assuming tasks have acertain affinity to specific cores) corresponds to a min-min-max optimization problem.An algorithm is designed first that adheres to the Nash Bargaining Solution that yieldsPareto-optimal solutions but has a high complexity. Then the min-min-max problem istransformed into a max-max-min optimization problem which significantly lowers thecomplexity of the problem. The immense advantage of this conversion besides lowercomplexity is the Pareto-optimality and fairness that comes with the guarantee to al-ways have a bargaining point for the max-max-min problem. There is a great matchbetween the problem scenario and game theory, with cores of a multicore processor com-peting to optimize multiple goals from multiple perspectives. The scheduler can easilymodify the rules to adjust the strategy of games to suit various scheduling scenariosand policies. Game theory allows multiple objective functions for cores and memorymodules. These objectives can be dynamically adjusted and tuned.

—Overhead-Aware Joint Energy and Performance Optimization on MPSoCs. A heuris-tic approach to jointly optimize energy and performance while executing applicationsrepresented by periodic DAGs is explored in Liu et al. [2008]. The target is to explorethe mutual trade-off among the quantities by exploiting software pipelining as wellas DVFS and DPM. The proposed approach is a two-phase approach, where in thefirst step a given periodic DAG is transformed into a set of independent tasks and inthe second phase DVFS and DPM are applied on the scheduled DAG to obtain energysavings. The task-level software pipelining approach uses a technique called retimingto remove the intratask dependencies among the tasks. The result is a periodic DAGthat can be executed as a set of independent sets with only inter-set dependencies. Thepaper elaborates the advantage of working with a pipelined version of DAG as opposedto the original DAG by showing that the former can be scheduled with significantlytighter timing constraints. In the second phase, a “shrinking” and “extending” approach(SpringS) is used to adjust the initial schedule. The initial schedule is adjusted if thetiming constraint is violated and is afterwards “extended” in case the adjustment hasreduced the schedule length less than the constraint. The adjustments are made untilno further reduction in energy can be obtained or the timing constraint is violated. Theoverhead considerations are applied in the form of communication overhead as well astransitional overhead incurred due to switching among the voltage levels as well as theoff and on states of the cores. Experiments are performed for both randomly generatedand practical task-graphs by considering each core as equivalent to an AMD MobileAthlon processor. The proposed approach is compared with a non-DVFS list schedulingapproach as well as with a static DVFS approach for DAG scheduling that does notperform software pipelining (DAGwoSP). The proposed technique is able to achieve69.9% and 49.8% energy savings on average as compared to non-DVFS approach andDAGwoSP, respectively. Additionally, it is shown that SpringS can satisfy much tightertiming constraints than DAGwoSP for the same number of processors (cores). Trade-off surfaces among energy consumption and timing constraints are also presented fordifferent numbers of processors.

—Trade-Offs among Performance, Power and VM Migration Cost. Verma et al. [2008]investigate the power-aware application placement problem in a virtualization envi-ronment. They propose an application placement controller, pMapper, which can dy-namically place applications to achieve different trade-offs such as performance (SLA)optimization and power efficiency optimization. PMapper categorizes all the alloca-tion activities into three categories that are communicated and monitored by threesoftware components: performance manager covers performance-related activities in-cluding VM resizing and idling; power manager covers all power management activitiesat hardware layer; and virtualization manager covers consolidation activities via live

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 30: askdfjnaiow

32:30 H. F. Sheikh et al.

VM migration. All the three managers “report” to the arbitrator, that analyzes all thesystem information and makes decisions at the global view. To solve the problems ofdifferent objectives, the authors abstract the power-aware VM allocation problems tothe bin-packing problem with different bin sizes, in which objects’ volumes can presentVM parameters while bin sizes represent server parameters. Motivated by the first-fitsolution for the classic bin-packing problem, the authors propose the min Power Parity(mPP) algorithm to solve power-aware optimization problems, for example, the powerminimization problem. The mPP algorithms work on the basis of assumptions that,one, VMs are independent of servers, say, no extra power increments for some specificVM-server peers; the other is that the power slope between two servers applied withthe same VM will keep the direction for all VMs. Based on the assumptions, the mPPcan be shown globally optimal. However, the mPP algorithm will result in frequentVM migration for some cases. So the mPPH algorithm is proposed, which is migration-aware by keeping track of previous allocation information. However, it always achievesa suboptimal solution to the original optimization problem. Thus the third algorithmpMaP is proposed, that considers the ratio of power increments from migrations overthe migration cost. By choosing the lowest ratio before deciding any migration, pMaPcan be shown locally optimal. Their results show pMaP will be better than mPP on av-erage and the gap is acceptable between the local optimal solution and global optimalsolution in the best case. But both algorithms outperform load-balancing algorithms.The proposed method shows good applicability both in theoretical and practical do-mains. However, their assumptions can hold for a heterogeneous environment andthe proposed algorithms are all offline, so that say, the arbitrator should know therequired information of all applications, VMs and servers. Thus it may fail for thereal-time application/task model.

—Trade-Offs among Performance, Power and Temperature. The work in Bao et al.[2009] proposes a scheme to address the problems of suboptimality and computationaloverhead while applying DVFS for task allocation using a task model of an MPEG-2decoder on homogeneous cores. The DVFS schemes normally used for optimizationof the energy utilized in the processors assume the maximum allowable temperature(Tmax) for calculating the leakage current and fmax. This results in suboptimal so-lutions. Additionally, for many DVFS approaches it is assumed that applications willcomplete in WNC (worst number of cycles), which is not always true. The article elab-orates that the use of the actual temperature of the core while calculating the valuesof fmax and taking the leakage current into consideration, the temperature-frequencydependence can improve the overall efficiency of DVFS. To apply such a scheme atruntime, a two-phase approach is proposed. An offline approach for pre-computing thevoltage/frequency settings for all tasks is proposed. The offline approach computes theactual temperature for an application/task iteratively starting with Tmax and then re-peating the computation of voltage/frequency (required for energy optimization) untilconvergence is achieved.

—Prediction-Based Limited Look-Ahead Control. In Kusic et al. [2009], the authorspropose a dynamic workload allocation strategy for virtualized clustered servers. Theoptimization objective is the overall processing rates of VMs so as to achieve the bestSLA satisfactory. The trade-off is also considered between power consumption (activeVM numbers, active server numbers, and CPU shares to each VM) and performance(SLA violation) by the rules that SLA violation leads to penalty of profit while satis-factory SLA results in rewards. The strategy applies for homogenous environments,that is, VMs are all identical so the power consumption on any server can be easilycalculated by VM numbers. The proposed approach is offline, that is, when any task

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 31: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:31

comes in, the costs of all possible allocations are already known. Based on this idea,all the states are classified into horizons with lengths by prediction. So any workloadinput can have state prediction chains with all possible states after it is allocated asthe head of any chains. Then the head of the optimal chain among all options is chosenas the result of this workload allocation. When next input comes, the chosen procedureis repeated. It is called the limited look-ahead control strategy and the semi-globaloptimal solution can attain more profit than the accumulation of simple local optimalsolutions of each step. To trade-off between different SLA requirements, the risk vari-able is defined to decide how aggressive the optimization is held. If the variable is zero,the misses of predictions are ignored, which suits for the scenario of low penalty on SLAviolations. However, it is obvious that the scalability of this approach is not ideal sincethe linear increase of the number of VMs or servers will lead to an exponential increaseof possible state options. To solve this problem, the authors give an alternative strategyfor large clusters, in which a neural network is built to record the historic chosen statechain of inputs. When the neural network is well-trained, it can provide an estimatedstate chain for any coming workload, instead of choosing from all the chain options. Sothe scalability complexity is degraded to polynomial. Although this approach is simu-lated and the result outperforms the simple greedy workload allocation algorithm, thecomparison with other VM consolidation approaches is yet lacking.

—Scheduling Parallel Workloads on Multicore Clusters for Joint Optimization. Sev-eral cluster management policies leveraging frequency scaling and node scaling toimprove the energy consumption and turnaround time of jobs submitted to a systemare proposed in Lammie et al. [2009]. The proposed approach designs a multilayerjoint optimization scheme that combines the effort of determining the optimal numberof nodes that should be kept in “ON” state based on the parallel workload along withthe DVFS and task allocation scheme. The target is to take advantage of the burstynature of the workload. The paper compares cluster management schemes which donot perform node scaling, perform node scaling but do not perform frequency scaling,and those which perform both. In the second approach, a task in the queue is eitherassigned to an idle node or a new node is turned on in the case where no idle node isavailable. In the third case, frequencies are first scaled up to determine if they can com-plete the new task or the task in the queue, or else a new machine is turned on to fulfillthe task completion requirements. The third policy, called SWQ, is then combined withthree heuristics to improve its task assignment procedure. SWQI allocates the nextjob in the queue to a core in the machine scheduled to run the longest. SWQN assignsthe pending job to the machine with the most recently submitted tasks while SWQOmaps the task to the machine with oldest running job. Among these adaptive schemesSWQI and SWQN consume the least energy except for the case when the number ofnodes is very small. The experiments are conducted based on the workload data of fourdifferent parallel systems. In terms of turnaround times, SWQI and SWQO are able toachieve good turnaround times under variable load as compared to the best round-triptime. The efficiency of machines in terms of cycles per second, equipped with DVFS,was also investigated. The authors make the observation that at full capacity machineshave better cycles/joule value than having the machine running at higher frequency,mainly due to the wasted cycles. The proposed approach is adaptive, yet its behaviorat runtime may need further investigations.

The combinational optimization of performance and power consumption given differ-ent constraints can be processed by different approaches. Furthermore, comprehensivetrade-offs need to be evaluated for complex optimization like those among power, tem-perature, and performance using different objectives.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 32: askdfjnaiow

32:32 H. F. Sheikh et al.

5. DISCUSSION

The discussion in Section 4 on various scenarios for keeping energy in perspective formodern computational devices presents an overview of the current possibilities in thatdirection. However, at the same time we observe that there are still some limitationsregarding the possibilities for energy management in computational resources. We cannotice that many research efforts make significant assumptions about the workload,nature of the system, or the type of environment in which their scheme can be used.Therefore, no solution can be considered as the principle one or a general approach forenergy minimization. Also, it can be observed that as the size of the systems grows(at infrastructure level), there are usually more opportunities available for optimizingenergy. In our example, in an HPC facility, the designers and managers have to selectthe best method for minimizing the energy consumption by keeping in mind variousmodels and conditions applicable to their system.

6. CONCLUSION AND RESEARCH DIRECTIONS

A recent surge of research on energy-efficient computing techniques has given us newopportunities and challenges. Previous research on energy-efficiency problems weresummarized in this survey with the different objectives, such as PCEO, ECPO, DOPO,and different target platforms, such as single processors, multicore processors, andparallel and distributed systems.

To extend existing work and create innovations on energy-efficient techniques, espe-cially in distributed systems, several directions are discussed next which are rarely ornot completely investigated in current literature.

Support resources that are heterogeneous and dynamic. A major advantage of anetwork-based high-performance computing environment such as grids and cloudsis the heterogeneity of the computing machines, that requires assigning the tasks tomachines that are best suited to the tasks’ nature and requirements but exacerbatesthe energy-aware scheduling problem [Foster and Kesselman 1997]. In such systems,resources also become dynamically available, not only compounding the complexity ofscheduling but also requiring dynamic monitoring of resources’ energy usage. The sameis going to happen as multicore systems become heterogeneous and each core is bestsuited to certain type of load. Most scheduling algorithms assume a fixed availabilityof resources or specific type of hardware, but in a distributed environment, resourcesmay be added to or removed from the shared pool dynamically. Dynamic changes inavailable resources can significantly impact the energy and time requirements andshould be carefully incorporated in scheduling. It is desirable to empower grids andclouds with fast, dynamic, scalable, and adaptive governing mechanisms instead ofstatic and inflexible static manual solutions. Following this direction, novel algorithmscan be developed that exploit dependencies among different tasks for slack alloca-tion. Some nodes in the DAG (e.g., critical path nodes) may benefit more than othersfrom slack allocation. Moreover, these priorities may change as the execution proceeds,making the problem more interesting because the profits and losses of players varyin a game that dynamically changes. This affects the objective functions and gamingstrategies, that can be captured by cooperative gaming algorithms [Khan and Ahmad2006, 2007b]. Involving bargaining and cooperative games, existing critical path anal-ysis methods [Kwok and Ahmad 1996] can be utilized and new suitable methods canbe developed for taking advantage of the precedence relationships between tasks forslack allocation and assignment. Moreover, dynamic algorithms will also take advan-tage of energy monitoring. To determine the energy consumption of different processesat runtime, time-driven statistical techniques are used [Chang et al. 2002; Gniady

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 33: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:33

et al. 2004]. Power monitoring tools can be exploited with integration with the resourcemanagement framework, such as those in the Vista operating system [Microsoft 2007a,2007b].

Support a wide variety of trade-offs beyond energy and performance. In a distributedenvironment, one expects requests that represent different trade-offs between energyand performance. For example, one request may be time critical; here minimizing timeis the key goal. To meet this goal, the scheduler may have to sacrifice some energy.Another request may have loose timing goals, allowing energy consumption to remainwithin a given energy quota. Current literature investigates the trade-off betweenenergy and performance in terms of execution time, SLA, QoS, etc. However, onlyfew research studies investigate the optimization with other objectives. For example,thermal management is similar to energy management topics. As energy presentsthe accumulated power consumption, temperature is more closely related to the peakpower. A possible requirement is to save energy with both performance and temperatureconstraints. The exploration on the optimization problem with energy, temperature,and performance is necessary and attractive.

REFERENCES

ABDELZAHER, T. AND LU, C. 2001. Schedulability analysis and utilization bounds for highly scalable real-timeservices. In Proceedings of the IEEE Real-Time Technology and Applications Symposium.

ACPI. 1999. Advanced configuration and power interface specification revision 4.0a. http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf.

AEA. 2008. American electronics association report cybernation. http://www.aeanet.org.AHMAD, I. AND KWOK, Y. 1998. On exploiting task duplication in parallel program scheduling. IEEE Trans.

Parallel Distrib. Syst. 9, 9, 872–892.AHMAD, I. AND LUO, J. 2006. On using game theory for perceptually tuned rate control algorithm for video

coding. IEEE Trans. Circ. Syst. Video Technol. 16, 2, 202–208.AHMAD, I., KHAN, S., AND RANKA, S. 2008. Using game theory for scheduling tasks on multi-core processors

for simultaneous optimization of performance and energy. In Proceedings of the Workshop on NSF NextGeneration Software Program in Conjunction with the International Parallel and Distributed ProcessingSymposium.

AHMAD, I., ARORA, R., WHITE, D., METSIS, V., AND INGRAM, R. 2009. Energy-Constrained scheduling of dags onmultiprocessors. In Proceedings of the 1st International Conference on Contemporary Computing.

ALBONESI, D. 2002. Selective cache ways: On demand cache resource allocation. J. Instruct.-Level Parall.ALENAWY, T. AND AYDIN, H. 2005. Energy-Constrained scheduling for weakly-hard real-time systems. In Pro-

ceedings of the 26th IEEE International Real-Time Systems Symposium. 376–385.AMD. 2008. Amd firestream 9170 stream processor. http://ati.amd.com/technology/streamcomputing/

product firestream 9170.html.ANDRAE, M. 1991. Biomass burning: Its history, use, and distribution and its impacts on the environmental

quality and global change. In Global Biomass Burning: Atmospheric, Climatic, and Biosphere Implica-tions, J. S. Levine, Ed., MIT Press, Cambridge, MA, 3–21.

ATLAS COLLABORATION. 1999. Atlas physics and detector performance. Tech. des. rep. LHCC.AYDIN, H., MELHEM, R., MOSSE, D., AND MEJIA-ALVAREZ, P. 2001. Optimal reward-based scheduling for periodic

real-time tasks. IEEE Trans. Comput. 50, 111–130.AYDIN, H., MELHEM, R., MOSS, D., AND MEJA-ALVAREZ, P. 2004. Power-Aware scheduling for periodic real-time

tasks. IEEE Trans. Comput. 53, 5, 584–600.AZEVEDO, A., CORNEA, R., ISSENIN, I., GUPTA, R., DUTT, N., NICOLAU, A., AND VEIDENBAUM, A. 2001. Architectural

and compiler strategies for dynamic power management in the copper project. In Proceedings of theInternational Workshop on Innovative Architecture.

BADER, D., LI, Y., LI, T., AND SACHDEVA, V. 2005. BioPerf: A benchmark suite to evaluate high-performance com-puter architecture on bioinformatics applications. In Proceedings of the IEEE International Symposiumon Workload Characterization.

BAEK, W. AND CHILIMBI, T. 2010. Green: A framework for supporting energy-conscious programming usingcontrolled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming LanguageDesign and Implementation.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 34: askdfjnaiow

32:34 H. F. Sheikh et al.

BAO, M., ANDREI, A., ELES, P., AND PENG, Z. 2009. On-Line thermal aware dynamic voltage scaling for en-ergy optimization with frequency/temperature dependency consideration. In Proceedings of the 46th

ACM/IEEE Design Automation Conference (DAC’09). 490–495.BLAND, B. 2006. Leadership computing facility. Presented at The Fall Creek Falls Workshop.BORKAR, S. 1999. Design challenges of technology scaling. IEEE Micro 19, 4, 23–29.BROOK, B. AND RAJAMANI, K. 2003. Dynamic power management for embedded systems. In Proceedings of the

IEEE International Systems-on-Chip (SOC) Conference.BROOKS, D., BOSE, P., SCHUSTER, S., JACOBSON, H., KUDVA, P., BUYUKTOSUNOGLU, A., WELLMAN, J., ZYBAN, V.,

GUPTA, M., AND COOK, P. 2000. Power aware microarchitecture: Design and modeling challenges fornext-generation microprocessors. IEEE Micro 20, 6, 26–44.

BURD, T., PERING, T., STRATAKOS, A., AND BRODERSEN, R. 2000. Dynamic voltage scaled microprocessor system.IEEE J. Solid-State Circ. 35, 11, 1571–1580.

BUTTAZZO, G. 2005. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applica-tions. Springer.

CAVIUM NETWORKs. 2008. Octeon plus cn58xx multi-core mips64 based soc processors. http://www.caviumnetworks.com/OCTEON-Plus CN58XX.html.

CHANDRAKASAN, A., SHENG, S., AND BRODERSEN, R. 1992. Low-Power cmos digital design. IEEE J. Solid-StateCirc. 27, 4, 473–484.

CHANG, F., FARKAS, K., AND RANGANATHAN, P. 2002. Energy-Driven statistical profiling: Detecting softwarehotspots. In Proceedings of the Workshop on Power Aware Computing Systems.

CHEN, M. AND MISHRA, P. 2009. Efficient techniques for directed test generation using incremental satisfia-bility. In Proceedings of the 22nd International Conference on VLSI Design. IEEE Computer Society, LosAlamitos, CA, 65–70.

CHUNG, E., BENINI, L., AND MICHELI, G. 1999. Dynamic power management using adaptive learning tree. InProceedings of the International Conference on Computer-Aided Design. 274–279.

CMS COLLABORATION. 2012. Cms data grid system overview and requirements. CMS note 037.DAREMA, F. 2005. Grid computing and beyond: The context of dynamic data driven applications systems. Proc.

IEEE 93, 3, 692–697.DATAQUEST. 1992. http://data1.cde.ca.gov/dataquest/.DEVADAS, V., LI, L., AND AYDIN, H. 2009. Competitive analysis of energy-constrained real-time scheduling. In

Proceedings of the 21st Euromicro Conference on Real-Time Systems. 217–226.ELNOZAHY, E., KISTLER, M., AND RAJAMONY, R. 2002. Energy-Efficient server clusters. In Proceedings of the

PACS Conference. 179–196.FELTER, W., RAJAMANI, K., KELLER, T., AND RUSU, C. 2005. A performance-conserving approach for reducing peak

power consumption in server systems. In Proceedings of the International Conference on Supercomputing.293–302.

FENG, W. AND CAMERON, K. 2007. The Green500 list: Encouraging sustainable supercomputing. IEEE Comput.40, 12, 50–55.

FLINN, J. AND SATYANARAYANAN, M. 2004. Managing battery lifetime with energy-aware adaptation. ACM Trans.Comput. Syst. 22, 2, 179.

FOSTER, I. AND KESSELMAN, C. 1997. Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput.Appl. 11, 2, 115–128.

GANDHI, A., HARCHOL-BALTER, M., DAS, R., AND LEFURGY, C. 2009. Optimal power allocation in server farms.In Proceedings of the 11th International Joint Conference on Measurement and Modeling of ComputerSystems. ACM, New York, 157–168.

GE, R., FENG, X., AND CAMERON, K. 2005. Performance-Constrained distributed dvs scheduling for scientific ap-plications on power-aware clusters. In Proceedings of the 17th IEEE/ACM High Performance Computing,Networking and Storage Conference. 11.

GE, R., FENG, X., SONG, S., CHANG, H., LI, D., AND CAMERON, K. 2010. PowerPack: Energy profiling and analysisof high-performance systems and applications. IEEE Trans. Parall. Distrib. Syst. 21.

GHASEMAZAR, M., PAKBAZNIA, E., AND PEDRAM, M. 2010. Minimizing energy consumption of a chip multiprocessorthrough simultaneous core consolidation and dvfs. In Proceedings of the IEEE International Symposiumon Circuits and Systems. 49–52.

GHAZAALEH, N., MOSSE, D., CHILDERS, B., MELHEM, R., AND CRAVEN, M. 2003. Collaborative operating systemand compiler power management for real-time applications. In Proceedings of the 9th IEEE Real-Timeand Embedded Technology and Applications Symposium.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 35: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:35

GNIADY, C., HU, Y., AND LU, Y. 2004. Program counter based techniques for dynamic power management. InProceedings of the 10th International Symposium on High Performance Computer Architecture.

GONZALEZ, R. AND HOROWITZ, M. 1996. Energy dissipation in general-purpose microprocessors. IEEE J. Solid-State Circ. 31, 9, 1277–1284.

GREEN GRID. 2012. http://www.thegreengrid.org/home.HUANG, Z. AND MALIK, S. 2001. Managing dynamic reconfiguration overhead in systems-on-a-chip design

using reconfigurable datapaths and optimized interconnection networks. In Proceedings of the Design,Automation and Test in Europe Conference and Exhibition. 735–740.

HOFFMANN, H., SIDIROGLOU, S., CARBIN, M., MISAILOVIC, S., AGARWAL, A., AND RINARD, M. 2011. Dynamic knobsfor responsive power-aware computing. SIGPLAN Not. 46, 3.

JEJURIKAR, R., PEREIRA, C., AND GUPTA, R. 2004. Leakage aware dynamic voltage scaling for real-time embeddedsystems. In Proceedings of the Design Automation Conference. 275–280.

JEJURIKAR, R. AND GUPTA, R. 2005a. Dynamic slack reclamation with procrastination scheduling in real-timeembedded systems. In Proceedings of the Design Automation Conference. 111–116.

JEJURIKAR, R. AND GUPTA, R. 2005b. Energy aware non-preemptive scheduling for hard real-time systems. InProceedings of the 17th Euromicro Conference on Real-Time Systems. 21–30.

JERGER, N., VANTREASE, D., AND LIPASTI, M. 2007. An evaluation of server consolidation workloads for multi-core designs. In Proceedings of the IEEE 10th International Symposium on Workload Characterization.47–56.

KAMIL, S., SHALF, J., AND STROHMAIER, E. 2008. Power efficiency in high performance computing. In Proceedingsof the IEEE International Symposium on Distributed Processing (IPDPS’08). 1–8.

KANG, J. AND RANKA, S. 2008a. DVS based energy minimization algorithm for parallel machines. In Proceedingsof the IEEE International Symposium on Distributed Processing (IPDPS’08). 1–12.

KANG, J. AND RANKA, S. 2008b. Dynamic algorithms for energy minimization on parallel machines. In Proceed-ings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP’08).399–406.

KHAN, S. U. AND AHMAD, I. 2006. Non-Cooperative, semi-cooperative and cooperative games-based grid resourceallocation. In Proceedings of the 20th International Parallel and Distributed Processing Symposium(IPDPS’06).

KHAN, S. U. AND AHMAD, I. 2007. A cooperative game theoretical replica placement technique. In Proceedingsof the International Conference on Parallel and Distributed Systems. 1–8.

KHANNA, G., BEATY, K., KAR, G., AND KOCHUT, A. 2006. Application performance management in virtualizedserver environments. In Proceedings of the 10th IEEE/IFIP Network Operations and Management Sym-posium (NOMS’06). 373–381.

KIM, K. H., BUYYA, R., AND JONG KIM. 2007. Power aware scheduling of bag-of-tasks applications with deadlineconstraints on dvs-enabled clusters. In Proceedings of the 7th IEEE International Symposium on ClusterComputing and the Grid (CCGrid’07). 541–548.

KREMER, U., HICKS, J., AND REHG, J. M. 2000. Compiler-Directed remote task execution for power management.In Proceedings of the Workshop on Compilers and Operating Systems for Low Power.

KUSIC, D., KEPHART, J. O., HANSON, J. E., KANDASAMY, N., AND JIANG, G. 2009. Power and performance manage-ment of virtualized computing environments via lookahead control. Cluster Comput. 12, 1–15.

KWOK, Y. AND AHMAD, I. 1996. Dynamic critical-path scheduling: An effective technique for allocating taskgraphs to multiprocessors. IEEE Trans. Parall. Distrib. Syst. 7, 506–521.

LAMMIE, M., BRENNER, P., AND THAIN, D. 2009. Scheduling grid workloads on multicore clusters to minimizeenergy and maximize performance. In Proceedings of the 10th IEEE/ACM International Conference onGrid Computing. 145–152.

LEE, Y. C. AND ZOMAYA, A. Y. 2007. Practical scheduling of bag-of-tasks applications on grids with dynamicresilience. IEEE Trans. Comput. 56, 815–825.

LEE, Y. C. AND ZOMAYA, A. Y. 2009. Minimizing energy consumption for precedence-constrained applicationsusing dynamic voltage scaling. In Proceedings of the 9th IEEE/ACM International Symposium on ClusterComputing and the Grid (CCGrid’09). 92–99.

LI, K. 2008. Performance analysis of power-aware task scheduling algorithms on multiprocessor computerswith dynamic voltage and speed. IEEE Trans. Parall. Distrib. Syst. 19, 1484–1497.

LIANG, Y. AND AHMAD, I. 2006. Power and distortion optimization for ubiquitous video coding. In Proceedingsof the International Conference on Image Processing (ICIP’06).

LIU, H., SHAO, Z., WANG, M., AND CHEN, P. 2008. Overhead-Aware system-level joint energy and performanceoptimization for streaming applications on multiprocessor systems-on-chip. In Proceedings of the Eu-romicro Conference on Real-Time Systems (ECRTS’08). 92–101.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 36: askdfjnaiow

32:36 H. F. Sheikh et al.

LOVEDAY, J. 2002. The sloan digital sky survey. Contemp. Phys. 43.LU, Y., BENINI, L., AND DE MICHELI, G. 2000. Low-Power task scheduling for multiple devices. In Proceedings

of the 8th International Workshop on Hardware/Software Codesign. ACM, New York, 39–43.LUO, J. AND JHA, N. K. 2000. Power-Conscious joint scheduling of periodic task graphs and aperiodic tasks in

distributed real-time embedded systems. In Proceedings of the IEEE/ACM International Conference onComputer-Aided Design. 357–364.

MALIK, A., MOYER, B., AND CERMAK, D. 2000. A low power unified cache architecture providing power andperformance flexibility (poster session). In Proceedings of the International Symposium on Low PowerElectronics and Design. ACM, New York, 241–243.

MICROSOFT. 2007a. Microsoft whitepaper, application power management best practices for windows vista.http://www.microsoft.com/whdc/system/pnppwr/powermgmt/PM apps.mspx.

MICROSOFT. 2007b. Microsoft whitepaper, processor power management in windows vista and windows server2008. http://www.microsoft.com/whdc/system/pnppwr/powermgmt/ProcPowerMgmt.mspx.

MISHRA, R., RASTOGI, N., DAKAI ZHU, MOSSE, D., AND MELHEM, R. 2003. Energy aware scheduling for distributedreal-time systems. In Proceedings of the International Parallel and Distributed Processing Symposium.

MOCHOCKI, B., HU, X. S., AND QUAN, G. 2007. Transition-Overhead-Aware voltage scheduling for fixed-priorityreal-time systems. ACM Trans. Des. Autom. Electron. Syst. 12. http://doi.acm.org/10.1145/1230800.1230803.

MONTET, C. AND SERRA, D. 2003. Game Theory and Economics. Palgrave Macmillan.MPI-FORUM. 2008. MPI: A message-passing interface standard. http://www.mpi-gotum.org/docs/mpi-1.3/mpi-

report-1.3-2008-05.30.pdf.NASAES. 2012. Nasa earth science. http://science.nasa.gov/earth-science/.NATHUJI, R., ISCI, C., GOBATOV, E., AND SCHWAN, K. 2008. Providing platform heterogeneity-awareness for data

center power management. Cluster Comput. 11, 259–271.NEWEGG. 2008. AMD phenom 9850 specifications. http://ww.newegg.com/Product/Product.aspx?Item=N82#

16819103249.OIKAWA, S. AND RAJKUMAR, R. 1999. Portable RK: A portable resource kernel for guaranteed and enforced

timing behavior. In Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium.111–120.

ORNL. 2012. OLCF jaguar. http://www.olcf.ornl.gov/computing-resources/jaguar/.PERING, T., BURD, T., AND BRODERSEN, R. 2000. Voltage scheduling in the iparm microprocessing system. In

Proceedings of the International Symposium on Low-Power Electronics and Design (ISPLED’00). 96–101.PERING, T., AGARWAL, Y., GUPTA, R., AND WANT, R. 2006. CoolSpots: Reducing the power consumption of wireless

mobile devices with multiple radio interfaces. In Proceedings of the 4th International Conference onMobile Systems, Applications and Services. ACM, New York, 220–232.

PETRUCCI, V., LOQUES, O., AND MOSSE, D. 2010. Dynamic optimization of power and performance for virtualizedserver clusters. In Proceedings of the ACM Symposium on Applied Computing. ACM, New York, 263–264.

QI, X. AND ZHU, D. 2008. Power management for real-time embedded systems on block-partitioned mul-ticore platforms. In Proceedings of the International Conference on Embedded Software and Systems(ICESS’08). 110–117.

RANVIJAY, Y. R. S. AND AGRAWAL, S. 2010. Efficient energy constrained scheduling approach for dynamicreal-time system. In Proceedings of the 1st International Conference on Parallel Distributed and GridComputing (PDGC’10). 284–289.

SCHMITZ, M. T. AND AL-HASHIMI, B. M. 2001. Considering power variations of dvs processing elements forenergy minimisation in distributed systems. In Proceedings of the 14th International Symposium onSystem Synthesis. 250–255.

SEO, E., JEONG, J., PARK, S., AND LEE, J. 2008. Energy efficient scheduling of real-time tasks on multicoreprocessors. IEEE Trans. Parall. Distrib. Syst. 19, 1540–1552.

SELVAKUMAR, S. AND SIVARAMMURTHY, C. 1994. Scheduling precedence constrained task graphs with non-negligible intertask communication onto multiprocessors. IEEE Trans. Parall. Distrib. Syst. 5, 328–336.

SHIN, Y. AND CHOI, K. 1999. Power conscious fixed priority scheduling for hard real-time systems. In Proceed-ings of the 36th Annual ACM/IEEE Design Automation Conference. ACM, New York, 134–139.

SRIKANTAIAH, S., KANSAL, A., AND ZHAO, F. 2008. Energy aware consolidation for cloud computing. In Proceedingsof the Conference on Power Aware Computing and Systems. USENIX Association, 10.

STOUT, Q. F. 2006. Minimizing peak energy on mesh connected systems. In Proceedings of the 18th AnnualACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 331.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.

Page 37: askdfjnaiow

Energy- and Performance-Aware Scheduling of Tasks on Parallel/Distributed Systems 32:37

SUBRATA, R., ZOMAYA, A. Y., AND LANDFELDT, B. 2008. A cooperative game framework for QoS guided joballocation schemes in grids. IEEE Trans. Comput. 57, 1413-1422.

SWAMINATHAN, V. AND CHAKRABARTY, K. 2005. Pruning-Based energy-optimal deterministic i/o device schedulingfor hard real-time systems. ACM Trans. Embed. Comput. Syst. 4, 141–167.

TOMIYAMA, H., ISHIHARA, T., INOUE, A., AND YASUURA, H. 1998. Instruction scheduling for power reduction inprocessor-based system design. In Proceedings of the Conference on Design, Automation and Test inEurope. IEEE Computer Society, 855–860.

UPTIME. 2012. Uptime institute. http://uptimeinstitute.org/.USAMI, K. AND HOROWITZ, M. 1995. Clustered voltage scaling technique for low-power design. In Proceedings

of the International Symposium on Low Power Design. ACM, New York, 3–8.USEPA. 2007. U.S. environmental protection agency “report to congress on server and data center energy

efficiency” public law 109-431, Energy Star Program.VENKATACHALAM, V. AND FRANZ, M. 2005. Power reduction techniques for microprocessor systems. ACM Comput.

Surv. 37, 195–237.VERMA, A., AHUJA, P., AND NEOGI, A. 2008. Power-Aware dynamic placement of hpc applications. In Proceedings

of the 22nd Annual International Conference on Supercomputing. ACM, New York, 175–184.WANG, Y., LIU, H., LIU, D., QIN, Z., SHAO, Z., AND SHA, E. H. 2011. Overhead-Aware energy optimization for

real-time streaming applications on multiprocessor system-on-chip. ACM Trans. Des. Autom. Electron.Syst. 16, 14:1–14:32.

YU, Y. AND PRASANNA, V. K. 2002. Power-Aware resource allocation for independent tasks in heterogeneous real-time systems. In Proceedings of the 9th International Conference on Parallel and Distributed Systems.341–348.

ZHANG, Y., HU, X., AND CHEN, D. Z. 2002. Task scheduling and voltage selection for energy minimization. InProceedings of the 39th Design Automation Conference. 183–188.

ZHANG, C., VAHID, F., AND NAJJAR, W. 2005. A highly configurable cache for low energy embedded systems.ACM Trans. Embed. Comput. Syst. 4, 363–387.

ZHANG, S. AND CHATHA, K. S. 2007. Approximation algorithm for the temperature-aware scheduling problem.In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’07). 281–288.

ZHANG, S., CHATHA, K. S., AND KONJEVOD, G. 2007. Approximation algorithms for power minimization of earliestdeadline first and rate monotonic schedules. In Proceedings of the ACM/IEEE International Symposiumon Low Power Electronics and Design (ISPLED’07). 225–230.

ZHU, D., MELHEM, R., AND CHILDERS, B. R. 2003. Scheduling with dynamic voltage/speed adjustment usingslack reclamation in multiprocessor real-time systems. IEEE Trans. Parall. Distrib. Syst. 14, 686–700.

ZHU, Y. AND MUELLER, F. 2004. Feedback edf scheduling exploiting dynamic voltage scaling. In Proceedings ofthe 10th IEEE Symposium on Embedded Technology and Applications (RTAS’04). 84–93.

ZHUO, J. AND CHAKRABARTI, C. 2005. An efficient dynamic task scheduling algorithm for battery-powered dvssystems. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’05).846–849.

Received July 2011; revised October 2011; accepted October 2011

ACM Journal on Emerging Technologies in Computing Systems, Vol. 8, No. 4, Article 32, Pub. date: October 2012.