policies for dynamic clock scheduling - stanford...

Policies for Dynamic Clock Scheduling

Dirk Grunwald Philip LevisCharles B. Morrey III Michael Neufeld

fgrunwald,levis,cbmorrey,[email protected] of Computer Science

University of ColoradoBoulder, Colorado

Keith I. Farkas

[email protected] Computer Corp.

Western Research LabPalo Alto, California

Abstract

Pocket computers are beginning to emerge that providesufficient processing capability and memory capacity torun traditional desktop applications and operating sys-tems on them. The increasing demand placed on thesesystems by software is competing against the continu-ing trend in the design of low-power microprocessors to-wards increasing the amount of computation per unit ofenergy. Consequently, in spite of advances in low-powercircuit design, the microprocessor is likely to continueto account for a significant portion of the overall powerconsumption of pocket computers.

This paper investigates clock scaling algorithms on theItsy, an experimental pocket computer that runs a com-plete, functional multitasking operating system (a ver-sion of Linux 2.0.30). We implemented a number ofclock scaling algorithms that are used to adjust the pro-cessor speed to reduce the power used by the proces-sor. After testing these algorithms, we conclude that cur-rently proposed algorithms consistently fail to achievetheir goal of saving power while not causing user appli-cations to change their interactive behavior.

1 Introduction

Dynamic clock frequency scaling and voltage scaling aretwo mechanisms that can reduce the power consumed bya computer. Both voltage scaling and frequency scalingare important; the power consumed by a component im-plemented in CMOS varies linearly with frequency andquadratically with voltage.

To evaluate the relative importance and the situations in

which either is useful, it is necessary to consider energy,the integral of power over time. By reducing the fre-quency at which a component operates, a specific oper-ation will consume less power but may take longer tocomplete. Although reducing the frequency alone willreduce the average power used by a processor over thatperiod of time, it may not deliver a reduction in energyconsumption overall, because the power savings are lin-early dependent on the increased time. While greaterenergy reductions can be obtained with slower clocksand lower voltages, operations take longer; this exposesa fundamental tradeoff between energy and delay.

Many systems allow the processor clock to be varied.More recently, there are a number of processors thatallow the processor voltage to be changed. For exam-ple, the StrongARM SA-2 processor, currently beingdesigned by Intel, is estimated to dissipate 500mW at600MHz, but only 40mW when running at 150MHz –a 12-fold energy reduction for a 4-fold performance re-duction [1]. Likewise, the Pentium-III processor withSpeedStep technology dissipates 9W at 500MHz but22W at 650MHz [2], AMD has added clock and voltagescaling to the AMD Mobile K6 Plus processor familyand Transmeta has also developed processors with volt-age scaling. Because of this tradeoff in speed vs. power,the decision of when to change the frequency or the volt-age and frequency of such processors must be made ju-diciously while taking into account application demandand quality of user experience.

We believe that the decision to change processor speedand voltage must be controlled by the operating system.The operating system or similar system software is theonly entity with a global view of resource usage and de-mand. Although it is clear that the operating systemshould control the scheduling mechanism, it is not clearwhat inputs are necessary to formulate the scheduling

policy. There are two possible sources of information forpolicies. The application can estimate activity, providinginformation to the operating system about computationrates or deadlines, or the operating system can attempt toinfer some policy for the applications from their behav-ior. These can be used separately or in concert to controlvoltage and processor speed.

A number of studies have investigated policies to auto-matically infer computation demands and adjust the pro-cessor accordingly. We have implemented those previ-ously described algorithms; this paper describes our ex-perience.

In the next section, we present some background mate-rial. We discuss related work in Section 3. In Section 4we describe the schedulers we examine, our workloadand our measurement methodology. We then discuss ourresults in Section 5.

2 Background

To better understand the importance of voltage and clockscheduling, we begin by reviewing energy-consumptionconcepts, then present an overview of scheduling algo-rithms. Lastly, we give an overview of our test platform,the Itsy Pocket Computer.

2.1 Energy

The energy E, measured in Joules (J), consumed bya computer over T seconds is equal to the integral ofthe instantaneous power, measured in Watts (W). Theinstantaneous power consumed by components imple-mented in CMOS, such as microprocessors and DRAM,is proportional to V 2

�F , where V is the voltage supply-ing the component, and F is the frequency of the clockdriving the component. Thus, the power consumed bya computer to, say, search an electronic phone book,may be reduced by reducing V , F , or both. However,for tasks that require a fixed amount of work, reducingthe frequency may result in the system taking more timeto complete the work. Thus, little or no energy will besaved. There are techniques that can result in energy sav-ings when the processor is idle, typically through clockgating, which avoids powering unused devices.

In normal usage pocket computers run on batteries,which contain a limited supply of energy. However, as

discussed in [3], in practice, the amount of energy a bat-tery can deliver (i.e., its capacity) is reduced with in-creased power consumption. As an illustration of thiseffect, consider the Itsy pocket computer that was usedin this study (described in Section 2.3). When the sys-tem is idle, the integrated power manager disables theprocessor core but the devices remain active. If the sys-tem clock is 206 MHz, a typical pair of alkaline batterieswill power the system for about 2 hours; if the systemclock is set to 59 MHz, those same batteries will last forabout 18 hours. Although the battery lifetime increasedby a factor of 9, the processor speed was only decreasedby a factor of 3.5. The capacity of the battery can alsobe increased by interspacing periods of high power de-mand with much longer periods of low power demandresulting in a “pulsed power” system [4]. The extent towhich these two non-ideal properties can be exploitedis highly dependent on the chemical properties and theconstruction of a battery as well as the conditions un-der which the battery is used. In general, the former ef-fect (minimizing peak demand) is more important thanthe latter for the domain of pocket computers becausepulsed power systems need a significant period of timeto recharge the battery, and most computer applicationsplace a more constant demand on the battery.

If a system allows the voltage to be reduced when clockspeed is reduced (i.e. it supports voltage scaling), itis better to reduce the clock speed to the minimumneeded rather than running at peak speed and then beingidle. For example, consider a computation that normallytakes 600 million instructions to complete. That appli-cation would take one second on a StrongARM SA-2 at600MHz and would consume 500 mJoules. At 150MHz,the application would take four seconds to complete,but would only consume 160 mJoules, a four-fold sav-ings assuming that an idle computer consumes no en-ergy. There is obviously a significant benefit to runningslower when the application can tolerate additional de-lay. Pering [5] used the term voltage scheduling to meanscheduling policies that seek to adjust both clock speedand energy. The goal of voltage scheduling is to reducethe clock speed such that all work on the processor canbe completed “on time” and then reduce the voltage tothe minimum needed to insure stability at that frequency.

2.2 Clock Scheduling Algorithms

In scheduling the voltage at which a system operates andthe frequency at which it runs, a scheduler faces twotasks: to predict what the future system load will be(given past behavior) and to scale the voltage and clock

frequency accordingly. These two tasks are referred to asprediction and speed-setting [6]. We consider one sched-uler better than another if it meets the same deadlines (orhas the same behavior) as another policy but reduces theclock speed for longer periods of time.

The schedulers we implemented are interval schedulers,so called because the prediction and scaling tasks areperformed at fixed intervals as the system runs [7]. Ateach interval, the processor utilization for the interval ispredicted, using the utilization of the processor over oneor more preceding intervals. We consider two predic-tion algorithms originally proposed by Weiser et al. [7]:PAST and AVGN. Under PAST, the current interval ispredicted to be as busy as the immediately preceding in-terval, while under AVG, an exponential moving averagewith decay N of the previous intervals is used. That is,at each interval, we compute a “weighted utilization” attime t, Wt, as a function of the utilization of the previ-ous interval Ut�1 and the previous weighted utilizationWt�1. The AVGN policy sets Wt =

N�Wt�1+Ut�1

N+1. The

PAST policy is simply the AVG0 policy, and assumes thecurrent interval will have the same resource demands asthe previous interval.

The decision of whether to scale the clock and/or voltageis determined by a pair of boundary values used to pro-vide hysteresis to the scheduling policy. If the utilizationdrops below the lower value, the clock is scaled down;similarly, if the utilization rises above the higher value,the clock is scaled up. Pering et al. [8] set these values at50% and 70%. We used those values as a starting pointbut, as we discuss in Section 5.3, we found that the spe-cific values are very sensitive to application behavior.

Deciding how much to scale the processor clock is sep-arate from the decision of when to scale the clock up(or down). The SA-1100 processor used in the Itsy sup-ports 11 different clock rates or “clock steps”. Thus, ouralgorithms must select one of the discrete clock steps.We use three algorithms for scaling: one, double, andpeg. The one policy increments (or decrements) theclock value by one step. The peg policy sets the clockto the highest (or lowest) value. The double policytries to double (or halve) the clock step. Since the low-est clock step on the Itsy is zero, we increment the clockindex value before doubling it. Separate policies may beused for scaling upwards and downwards.

Figure 1: Equipment setups used to measure power.

2.3 The Itsy Pocket Computer

The Itsy Pocket Computer is a flexible research plat-form, developed to enable hardware and software re-search in pocket computing. It is a small, low-power,high-performance handheld device with a highly flexibleinterface, designed to encourage the development of in-novative research projects, such as novel user interfaces,new applications, power management techniques, andhardware extensions. There are several versions of thebasic Itsy design, with varying amount of RAM, flashmemory and I/O devices. We used several units for thisstudy that were modified by Compaq Computer Corpo-ration’s Western Research Lab to include instrumenta-tion leads for power measurement. Figure 1 shows theunits along with the measurement equipment we used.We investigate the energy and power consumption ofthe Itsy Pocket Computer when it is run at between59 MHz and 206 MHz, and when its StrongARM SA-1100 [9, 10] processor is powered at two different volt-age levels.

All versions of the Itsy are based on the low-powerStrongARM SA-1100 microprocessor. All versionshave a small, high-resolution display, which offers 320�200 pixels on a 0.18mm pixel pitch, and 15 levels ofgreyscale. All versions also include a touchscreen, a mi-crophone, a speaker, and serial and IrDA communica-tion ports. The Itsy architecture can support up to 128Mbytes both of DRAM and flash memory. The flashmemory provides persistent storage for the operatingsystem, the root file system, and other file systems anddata. Finally, the Itsy also provides a “daughter card”interface that allows the base hardware to be easily ex-tended. The Itsy uses two voltage supplies powered bythe same power source. The processor core is driven by a1.5 V supply while the peripherals are driven by a 3.3 V

StrongARMSA-1100

DRAM16 Mbyte

Power

IrDA

RS-232

9

1

32 32

200 x 320

15 Gray L.

LCD

ScreenTouch

MotherBoard

PCMCIA

32

SDLC

15 GPIO

SSP

USB

1

InterfaceAnalog

UART

FlashMemory

4 Mbyte

Processor

Connectivity

DRAM16 Mbyte

StaticMemory

16, 32

Card

A/D

Daughter

Figure 2: Itsy System Architecture

supply. Both power supplies are driven by a single 3:1Vsupply connected to the electrical mains.

The Itsy version 1.5 units used as the basis for this workhave 64 Mbytes of DRAM and 32 Mbytes of flash mem-ory. These units were modified to allow us to run theStrongARM SA-1100 at either 1.5 V or 1.23 V. Al-though 1.23 V is below the manufacturer’s specifica-tion, it can be safely used at moderate clock speeds andour measurements indicate the voltage reduction yieldsabout a 15% reduction in the power consumed by theprocessor; the percentage of power reduction for the sys-tem may be less than this (depending on workload) be-cause voltage scaling only reduces the power used by theprocessor. The Itsy can be powered either by an externalsupply or by two size AAA batteries. Figure 2 shows aschematic of the Itsy architecture.

The system software of the Itsy includes a monitor anda port of version 2.0.30 of the Linux operating sys-tem. The Linux system was configured to provide sup-port for networking, file systems and multi-user manage-ment. Applications can be developed using a number ofprogramming environments, including C, X-Windows,SmallTalk and Java. Applications can also take advan-tage of available speech synthesis and speech recogni-tion libraries.

3 Related Work

We believe that our evaluation of dynamic speed andvoltage setting algorithms to be the first such empiricalevaluation – to our knowledge, all previous work fromdifferent groups has relied on simulators [7, 6, 5, 11, 12];none modeled a complete pocket computer or the work-

load likely to be run on it.

Weiser et al. [7] proposed three algorithms, OPT,FUTURE, and PAST and evaluated them using tracesgathered from UNIX-based workstations running engi-neering applications. These algorithms use an interval-based approach that determines the clock frequency foreach interval. Of the algorithms they propose, onlyPAST is feasible because it does not make decisions us-ing future information that would not be available to anactual implementation. Even so, the actual version ofPAST proposed by by Weiser et al. is not implementablebecause it requires that the scheduler know the amountof work that had to be performed in the preceding in-tervals. This information was used by the scheduler tochoose a clock speed that allows this delayed work to becompleted in the next interval, if possible. For example,suppose post-processing of a trace revealed that the pro-cessor was busy 80% of the cycles while running at fullspeed. If, during re-play of the trace, the scheduler optedto run the processor at 50% speed for the interval, then30% of the work could not be completed in that interval.Consequently, in the next interval, the scheduler wouldadjust the speed in an effort to at least complete the 30%“unfinished” work. Without additional information fromthe application, the scheduler can simply observe thatthe application executed until the end of the schedulingquanta, and does not know the amount of “unfinished”computing left. Because most pocket computer applica-tions do not provide a means for the processor to knowhow much work should be done in a given interval, thePAST algorithm is not tractable for such systems.

The early work of Weiser et al. has been extended byseveral groups, including [6, 12]. Both of these groupsemployed the same assumptions and the same tracesused by Weiser. Govil et al. [6] considered a large num-ber of algorithms, while Martin [12] revised Weiser’sPAST algorithm to account for the non-ideal propertiesof batteries and the non-linear relationship between sys-tem power and clock frequency. Martin argues that thelower bound on clock frequency should be chosen suchthat the number of computations per battery lifetime ismaximized. While Martin correctly assumed a non-zeroenergy cost for idling the processor and changing clockspeed, neither Govil nor Weiser did.

Both our work and that of Pering et al. [5, 11] ad-dresses some of the limitations of the above noted ear-lier work. In particular, we both evaluate implementablealgorithms using workloads that are representative ofthose that might be run on pocket computers. We as-sess the success of our algorithms under the assumptionthat our applications have inelastic performance con-

straints and that the user should see no visible changesinduced by the scheduling algorithms. By comparison,Pering et al. assume that frames of an MPEG video,for instance, can be dropped and present results whichcombine a combination of energy savings vs. framerates. Our goal was to understand the performance ofthe different scheduling algorithms without introducingthe complexity of comparing multi-dimensional perfor-mance metrics such as the percentage of dropped framesvs. power savings.

Pering et al. use intervals of 10-50ms for their schedul-ing calculations. In comparison to the earlier approachespresented in [7, 6, 12] in which work was consideredoverdue if it was not completed within an interval, bothPering et al. and our study consider an event to haveoccurred on time if delaying its completion did not ad-versely affect the user. However, a number of impor-tant differences exist between our work and Pering etal.. First, Pering et al. model only the power consumedby the microprocessor and the memory, thus ignoringother system components whose power is not reducedby changes in clock frequency. Second, by virtue ofour work using an actual implementation, we are ableto evaluate longer running applications and more com-plex applications (e.g., Java). By virtue of their size,our applications exhibit more significant memory behav-ior, and thus, expose the non-linear relationship betweenpower and clock speed noted by Martin. Lastly, by us-ing an actual system, our scheduling implementationswere exposed to periodic behaviors that are capturedby traces; for example, the Java implementation uses a30ms polling loop to check for I/O events. This periodicpolling adds additional variation to the clock setting al-gorithms, inducing the sort of instability we will explainin x5.3.

4 Methodology

Before describing the implementation of the clock andvoltage scheduling algorithms we used, it is important tounderstand how we did our measurements. Section 4.1describes how we measure power and energy. We thendescribe the implementation of the schedulers and theworkloads we used to assess their performance.

4.1 Measuring Power and Total Energy

To measure the instantaneous power consumed by theItsy, we use a data acquisition (DAQ) system to recordthe current drawn by the Itsy as it is connected to an ex-ternal voltage supply, and the voltage provided by thissupply. Figure 1 presents a picture of our setup alongwith the wires connected to the Itsy to facilitate mea-suring the supply current1 and voltage. We configuredthe DAQ system to read the voltage 5000 times per sec-ond, and convert these readings to 16-bit values. Thesevalues were then forwarded to a host computer, whichstored them for subsequent analysis. From these mea-surements, we can compute a time profile of the powerused by an application as it runs on the Itsy.

To determine the relevant part of the power-usage pro-file of a workload, we measure the time required to ex-ecute the workload and then select the relevant set ofmeasurements from the data collected by the DAQ sys-tem. For each benchmark, we used the gettimeof-day system call to time its execution; this interface usesthe 3.6 MHz clock available on the processor to provideaccurate timing information. To synchronize the collec-tion of the voltages with the start of execution of a work-load, as the workload begins executing, we toggle oneof the SA1100’s general-purpose input-output (GPIO)pins. This pin is connected to the external trigger of theDAQ system; toggling the GPIO causes the DAQ systemto begin recording measurements. As our measurementtechnique is very similar to that which we used in [13],we refer the reader to this reference for a more in-depthdescription.

Once the relevant part of the profile has been deter-mined, we use it to calculate the average power andthe total energy consumed by the Itsy during the cor-responding time interval. To compute the energy, wemake the assumption that the power measured at timet represents the average power of the Itsy for the inter-val t to t + 0:0002 seconds, where 0.0002 seconds isthe time between each successive power measurement.Thus, the energy E is equal to

Pn

i=1 pi(t) � 0:0002,where p1(t); : : : ; pn(t) are the n power readings of in-terest.

In making our power measurements, we used a simi-lar approach as the one used in [13] to reduce a num-ber of sources of possible measurement error. We mea-

1The supply current was measured by measuring the voltage dropacross a high precision small-valued resistor of a known resistance(0:02). The current was then calculated by dividing the voltage bythe resistance.

sured multiple runs of each workload; in general, wefound the 95% confidence interval of the energy to beless than 0.7% of the mean energy. This implies that theruns were very repeatable, despite the possible variationthat would arise from interactions between applicationthreads, other processes and system daemons.

4.2 Workload

We used a varied workload to assess the performanceof the different clock scaling algorithms. Since it’s notclear what applications will be common on pocket com-puters, we used some obvious applications (web brows-ing, text reading) and other less obvious applications(chess, mpeg video and audio). The applications raneither directly on top of the Linux operating system orwithin a Java virtual machine [14]. To capture repeat-able behavior for the interactive applications, we useda tracing mechanism that recorded timestamped inputevents and then allowed us to replay those events withmillisecond accuracy. We did not trace the mpeg play-back because there is no user interaction, and we foundlittle inter-run variance. We used the following applica-tions:

MPEG: We played a 320x200 color MPEG-1 videoand audio clip at 15 frames a second. The mpegvideo was rendered as a greyscale image on theItsy. Audio was rendered by sending the au-dio stream as a WAV file to an audio playerwhich ran as a separate process, forked from thevideo player. There is no explicit synchroniza-tion between the audio and video sequences, butboth are sequenced to remain synchronized at 15frames/second. The clip is 14 seconds and wasplayed in a loop to provide 60 seconds of play-back.

Web: We used a Javabean version of the IceWebbrowser to view content stored on the itsy.We selected a file containing a stored articlefrom www.news.com concerning the Itsy. Wescrolled down the page, reading the full article.We then went back to the root menu and opened afile containing an HTML version of WRL techni-cal report TN-56, which has many tables describ-ing characteristics of power usage in Itsy compo-nents. The overall trace was 190 seconds of activ-ity.

Chess: We used a Java interface to version 16.10 of theCrafty chess playing program. Crafty was run as

a separate process. Crafty uses a play book foropening moves and then plays for specific periodsof time in later stages of the games and plays thebest move available when time expires. The 218second trace includes a complete game of Craftyplaying against a novice player (who lost, badly).

TalkingEditor: We used a version of the “mpedit” Javatext editor that had been modified to read text filesaloud using the DECtalk speech synthesis system(which is run in a separate process). The inputtrace records the user selecting a file to be openedusing the file dialogue, (i.e. moving to the direc-tory of the short text file and selecting the file),then having it spoken aloud and finally openingand having another text file read aloud. The tracetook 70 seconds.

The Kaffe Java system [14] uses a JIT, makes extensiveuse of dynamic shared libraries and supports a threadingmodel using setjmp/longjmp. The graphics library usedby Java is a modified version of the publically availableGRX graphics library and uses a polling I/O model tocheck for new input every 30 milliseconds. The MPEGplayer renders directly to the display.

4.3 Implementing the Scheduling Algorithms

We made two modifications to the Linux kernel to sup-port our clock scheduling algorithms and data record-ing. The first modification provides a log of the processscheduler activity. This component is implemented asa kernel module with small code modifications to thescheduler that allow the logging to be turned on andoff. For each scheduling decision, we record the pro-cess identifier of the process being scheduled, the timeat which it was scheduled (with microsecond resolution)and the current clock rate.

We also implemented an extensible clock scaling policymodule as a kernel module. We modified the clock in-terrupt handler to call the clock scheduling mechanismif it has been installed, and the Linux scheduler to keeptrack of CPU utilization. In Linux, the idle process al-ways uses the zero process identifier. The idle processenters a low-power “nap” mode that stalls the processorpipeline until the next scheduling interval. If the previ-ous process was not the idle process, the kernel adds theexecution time to a running total. On every clock inter-rupt, this total is examined by the clock scaling moduleand then cleared. The CPU utilization can be calculatedby comparing the time spent non-idle to the time length

of a quantum. Our time quantum was set to 10 msec, thedefault scheduling period in Linux; Pering et al. [5, 11]used similar values for their calculations.

Normally, a process can run for several quanta before thescheduler is called. The executing process is interruptedby the 100Hz system clock when the O/S decrementsand examines a counter in the process control block ateach interrupt. When that counter is zero, the scheduleris called. We set the counter to one each time we sched-ule a process, forcing the scheduler to be called every10ms. While this modification adds overhead to the exe-cution of an application, it allows us to control the clockscaling more rapidly. We measured the execution over-head and found it to be very small (about 6 microsecondsfor each 10ms interval, or 0.06%).

5 Results

The purpose of our study is to determine if the heuris-tics developed in prior studies can be practically appliedto actual pocket computers. We examined a number ofpolicies, most of which are variants of the AVGN policy.As described in x4.3, we used three different speed set-ting policies. Our intent was to focus on systems thatcould be implemented in an actual O/S and that did notrequire modifications to the applications (such as requir-ing information about deadlines or schedules). We as-sumed that our workloads had inelastic constraints; inother words, we assumed the applications had no way toaccommodate “missed deadlines”.

We split the discussion of our results into three parts.The first section describes aspects of the applicationsand how they differ from those used in prior work andthe second section discusses the performance of the dif-ferent clock scheduling algorithms. Finally, we examinethe benefit of the limited voltage scaling available on theItsy and summarize the results.

5.1 Application Characteristics

Figure 3 presents plots of the processor utilization overtime for each of the benchmark applications. This infor-mation was gathered using the on-line process loggingfacility that we added to the kernel. Due to kernel mem-ory limitations, we could only capture a subset of theprocess behavior. Each application was able to run at132MHz and still meet any user interaction constraints

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000

Time (microseconds)

Uti

lizat

ion

(a) MPEG Program at 206MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000 35000000

Time (microseconds)

Uti

lizat

ion

(b) Web Program at 206MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000

Time (microseconds)

Uti

lizat

ion

(c) Chess Program at 206MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000 35000000

Time (microseconds)

Uti

lizat

ion

(d) TalkingEditor Program at 206MHz

Figure 3: Utilization using 10ms Moving Average ForBetween 30 to 40 Second Intervals Using 206MHz Fre-quency Setting

(i.e. the application did not appear to behave any differ-ently).

The utilization is computed for each 10ms schedulingquantum. We used the same 10ms interval for loggingthat is used for scheduling within Linux. Since most pro-cesses compute for several quanta before yielding, thesystem is usually either completely idle or completelybusy during a given quantum. Some processes executefor only a short time then yield the processor prior tothe end of their scheduling quanta; for example, the Javaimplementation we used has a 30ms I/O polling loop –thus, when the Java system is “idle,” there is a constantpolling action every 30ms that takes about a millisecondto complete.

The behavior of the applications is difficult to predict,even for applications that should have very predictablebehavior and each application appears to run at a dif-ferent time-scale. The MPEG application renders at 15frames/sec; there are 450 frames in the 30 second in-terval shown in Figure 3. Each frame is rendered in67ms or just under 7 scheduling quanta. Any schedulingmechanism attempting to use information from a singleframe (as opposed to a single quanta) would need to ex-amine at least 7 quanta. Other applications have muchcoarser behavior. For example, the TalkingEditor appli-cation consumes varying amount of CPU time until thetext is being loaded for speech synthesis. The burstybehavior prior to the speech synthesis results from drag-ging images, JIT’ing applications and opening files. Fol-lowing this are long bursts of computation as the textis actually synthesized and send to the OSS-compatiblesound driver. Finally, more cycles are taken by the sounddriver. Thus, this application is bursty at a higher level.

For most applications, patterns in the utilization are eas-ier to see if you plot the utilization using a 100ms mov-ing average, as shown in Figure 4. The MPEG appli-cation, in Figure 4(a), is still very sporadic because ofinter-frame variation; for MPEG, there is even signif-icant variance in CPU utilization (60-80%) when con-sidering a 1 second moving average (not shown). TheChess and TalkingEditor applications show patterns in-fluenced by user interaction. It’s clear from Figure 4(c)that utilization is low when the user is thinking or mak-ing a move and that utilization reaches 100% whenCrafty is planning moves. Likewise, Figure 4(d) showsthe aforementioned pattern of synthesis and sound ren-dering.

0

0.2

0.4

0.6

0.8

1

1.2

0 5000000 10000000 15000000 20000000 25000000 30000000

Time (microseconds)

Uti

lizat

ion

(o

ver

100m

s)

(a) MPEG Application

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000 35000000

Time (microseconds)

Uti

lizat

ion

(o

ver

100m

s)

(b) Web Application

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000000 10000000 15000000 20000000 25000000 30000000

Time (microseconds)

Uti

lizat

ion

(o

ver

100m

s)

(c) Chess Application

0

0.2

0.4

0.6

0.8

1

1.2

0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000

Time (microseconds)

Uti

lizat

ion

(o

ver

100m

s)

(d) TalkingEditor Application

Figure 4: Utilization using 100ms Moving Average ForBetween 30 to 40 Second Intervals Using 206MHz Fre-quency Setting

5.2 Clock Scheduling Comparison

The goal of a clock scheduling algorithm is to try topredict or recognize a CPU usage pattern and then setthe CPU clock speed sufficiently high to meet the (pre-dicted) needs of that application. Although patterns inthe utilization are more evident when using a 100mssliding average for utilization, we found that averagingover such a long period of time caused us to miss our“deadline”. In other words, the MPEG audio and videobecame unsynchronized and some others applicationssuch as the speech synthesis engine had noticeable de-lays. This occurs because it takes longer for the systemto realize it is becoming busy.

This delay is the reason that the studies of Govil et al. [6]and Weiser [7] argued that clock adjustment should ex-amine a 10-50ms interval when predicting future speedsettings. However, as Figure 3 shows, it is difficult tofind any discernible pattern at the smaller time-scales.Like Govil et al., we also allowed speed setting to occurat any interval; Weiser et al. did not model having thescheduler interrupted while an application was running,but rather deferred clock speed changes to occur onlywhen a process yielded or began executing in a quanta.

There are a number of possible speed-setting heuristicswe could examine; since we were focusing on imple-mentable policies, we primarily used the policies ex-plored by Pering et al. [5]. We also explored other al-ternatives. One simple policy would determine the num-ber of “busy” instructions during the previous N 10msscheduling quanta and predict that activity in the nextquanta would have the same percentage of busy cycles.The clock speed would then be set to insure enough busycycles.

This policy sounds simple, but it results in exception-ally poor responsiveness, as illustrated in Figure 5. Fig-ure 5(a) shows the speed changes that would occur whenthe application is moving from period of high CPU uti-lization to one of low utilization; the speed changes to59MHz relatively quickly because we are adding in alarge number of idle cycles each quanta. By compar-ison, when the application moves from an idle periodto a fully utilized period, the simple speed setting pol-icy makes very slow changes to the processor utilizationand thus the processor speed increases very slowly. Thisoccurs because the total number of non-idle instructionsacross the four scheduling intervals grows very slowly.

206/1 206/1 206/1 206/1

Avg = 206, Speed = 206

206/1 206/1 206/1 206/0

Avg = 154.5, Speed = 162.5

206/1 206/1 206/0 162.5/0

Avg = 103, Speed = 103.2

206/1 206/0 162.5/0103.2/0

Avg = 51, Speed = 59

Tim

e

59/0

Avg = 0, Speed = 59

Avg = 14.75, Speed = 59

Avg = 29.5, Speed = 59

Avg = 44.25 Speed = 59

Tim

e59/0 59/0 59/0

59/0 59/0 59/0 59/1

59/0 59/0 59/1 59/1

59/0 59/1 59/1 59/1

(a) Going to Idle (b) Speeding up

Figure 5: Simple averaging behavior results in poor poli-cies. Each box represents a single scheduling interval,and the scheduling policy averages the number of non-idle instructions over the four scheduling quanta to selectthe minimum processor speed. To simplify the example,we assume each interval is either fully utilized or idle.The notation “206/0” means the CPU is set to 206MHzand the quanta is idle while “206/1” means the CPU isfully utilized.

5.3 The AVGN Scheduler

We had initially thought that a policy targeting the neces-sary number of non-idle cycles would result in good be-havior, but the previous example highlights why we usethe speed-setting policies described in x4.3. We used thesame AVGN scheduler proposed by Govil [6] and Per-ing [5] and also examined by Pering et al. in [5]; Per-ings later paper in [11] did not examine scheduler heuris-tics and only used real-time scheduling with application-specified scheduling goals.

Our findings indicate that the AVGN algorithm can notsettle on the clock speed that maximizes CPU utilization.Although a given set of parameters can result in optimalperformance for a single application, these tuned param-eters will probably not work for other applications, oreven the same application with different input. The vari-ance inherent in many deadline-based applications pre-vents an accurate assessment of the computational needsof an application. The AVGN policy can be easily de-signed to ensure that very few deadlines will be missed,but this results in minimal energy savings. We use anMPEG player as a running example in this section, asit best exemplifies behavior that illustrates the multitudeof problems in past-based interval algorithms. Our in-tuition is that if there’s is a single application that il-lustrates simple, easy-to-predict behavior, it should be

MPEG. Our measurements showed that the MPEG ap-plication can run at 132MHz without dropping framesand still maintain synchronization between the audio andvideo. An ideal clock scheduling policy would thereforetarget a speed of 132MHz.

However, without information from the user level appli-cation, a kernel cannot accurately determine what dead-lines an application operates under. First, an applicationmay have different deadline requirements depending onits input; for example, an MPEG player displaying amovie at 30fps has a shorter deadline than one runningat 15fps. Although the deadlines for an application witha given input may be regular, the computation requiredin each deadline interval can vary widely. Again, MPEGplayers demonstrate this behavior; I-frames (key or ref-erence) require much more computation than P-frames(predicted), and do not necessarily occur at predictableintervals.

One method of dealing with this variance is to look atlengthy intervals which will, by averaging, reduce thevariance of the computational observations. Our uti-lization plots showed that even using 100ms intervals,significant variance is exhibited. In addition to intervallength, the number of intervals over which we average(N ) of the AVGN policy can also be manipulated. Weconducted a comprehensive study and varied the valueof N from 0 (the PAST policy) to 10 with each com-bination of the speed-setting policies (i.e. using “peg”to set the CPU speed to the highest point, or “one” toincrement or decrement the speed).

Our conclusions from the results with our benchmarksis that the weighted average has undesirable behavior.The number of intervals not only represents the lengthof interval to be considered; it also represents the lag be-fore the system responds, much like the simple averag-ing example described above. Unlike that simple policy,once AVGN starts responding, it will do so quickly. Forexample, consider a system using an AVG9 mechanismwith an upper boundary of 70% utilization and “one” asthe algorithms used to increment or decrement the clockspeed. Starting from an idle state, the clock will not scaleto 206MHz for 120 ms (12 quanta). Once it scales up,the system will continue to do so (as the average utiliza-tion will remain above 70%) unless the next quantum ispartially idle. This occurs because the previous history isstill considered with equal weight even when the systemis running at a new clock value.

The boundary conditions used by Pering in [5] result ina system that scales more rapidly down than up. Table 1illustrates how this occurs. If the weighted average is

Time(ms) Idle/Active AVG<9> Notes10 Active 100020 Active 190030 Active 271040 Active 343950 Active 409560 Active 468570 Active 521780 Active 596590 Active 6125100 Active 6513110 Active 6861120 Active 7175 Scale up130 Active 7458 Scale up140 Active 7712 Scale up150 Active 7941 Scale up160 Idle 7146 Scale up170 Idle 6432180 Idle 5789190 Idle 5210200 Idle 4689 Scale down

Table 1: Scheduling Actions for the AVG9 Policy

70%, a fully active quantum will only increase the aver-age to 73% while a fully idle quantum will reduce it to63% – thus, there is a tendency to reduce the processorspeed.

The job of the scheduler is made even more difficult byapplications that attempt to make their own schedulingdecisions. For example, the default MPEG player inthe Itsy software distribution uses a heuristic to decidewhether it should sleep before computing the next frame.If the rendering of a frame completes and the time untilthat frame is needed is less than 12ms, the player en-ters a spin loop; if it is greater than 12ms, the playerrelinquishes the processor by sleeping. Therefore, if theplayer is well ahead of schedule, it will show significantidle times; once the clock is scaled close to the optimalvalue to complete the necessary work, the work seem-ingly increases. The kernel has no method of determin-ing that this is wasteful work.

Furthermore, there is some mathematical justificationfor our assertion that AVGN fundamentally exhibits un-desirable behavior, and will not stabilize on an optimalclock speed, even for simple and predictable workloads.Our analysis only examines the “smoothing” portion ofAVGN, not the clock setting policy. Nevertheless, itworks well enough to highlight the instability issues withAVGN by showing that, even if the system is started outat the ideal clock speed, AVGN smoothing will still resultin undesirable oscillation.

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.5 1 1.

5 2 2.5 3 3.

5 4 4.5 5 5.

5 6 6.5 7 7.

5 8 8.5 9 9.

5 1010

.5 1111

.5 1212

.5 1313

.5 1414

.5 15

Figure 6: Fourier Transform of a Decaying Exponential

A processor workload over time may be treated as amathematical function, taking on a value of 1 when theprocessor is busy, and 0 when idling. Borrowing tech-niques from signal processing allows us to characterizethe effect of AVGN on workloads in general as well asspecific instances. AVGN filters its input using a decay-ing exponential weighting function. For our implemen-tation, we used a recursive definition in terms of boththe previous actual (Ut�1) and weighted (Wt�1) utiliza-tions: Wt =

N�Wt�1+Ut�1

N+1. For the analysis, however,

it is useful to transform this into a less computation-ally practical representation, purely in terms of earlierunweighted utilizations. By recursively expanding theWt�1 term and performing a bit of algebra, this repre-sentation emerges: Wt =

1N+1

Pt�1

k=0(N

N+1)k�(t�1)Uk.

This equation explicitly shows the dependency of eachWt on all previous Ut, and makes it more evident thatthe weighted output may also be expressed as the resultof discretely convolving a decaying exponential func-tion with the raw input. This allows us to examine spe-cific types of workloads by artificially generating a rep-resentative workload and then numerically convolvingthe weighting function with it. We can also get a quali-tative feel for the general effects AVGN has by moving tocontinuous space and looking at the Fourier transform ofa decaying exponential, since convolving two functionsin the time domain is equivalent to multiplying their cor-responding Fourier transforms.

Lets begin by examining the Fourier transform of a de-caying exponential: x(t) = e��tu(t), where u(t) is theunit step function, 0 for all t < 0 and 1 for t � 0.This captures the general shape of the AVGN weight-ing function, shown in Figure 6. Its Fourier transform isX(!) = 1

i!+�. The transform attenuates, but does not

eliminate, higher frequency elements. If the input sig-nal oscillates, the output will oscillate as well. As � gets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800

Time

Uti

lizat

ion

Figure 7: Result of AVG3 Filtering on a the ProcessorUtilization for a Periodic Workload Over Time

smaller the higher frequencies are attenuated to a greaterdegree, but this corresponds to picking a larger value forN in AVGN and comes at the expense of greater lag inresponse to changing processor load.

For a specific workload example, we’ll use a simple re-peating rectangle wave, busy for 9 cycles, and then idlefor 1 cycle. This is an idealized version of our MPEGplayer running roughly at an optimal speed, i.e. just idleenough to indicate that the system isn’t saturated. Ide-ally, a policy should be stable when it has the system run-ning at an optimal speed. This implies that the weightedutilization should remain in a range that would preventthe processor speed from changing. However, as wasfore-shadowed by our initial qualitative discussion, thisis not the case. A rectangular wave has many high fre-quency components, and these result in a processor uti-lization as shown in Figure 7. This figure shows the os-cillation for this example, and shows that oscillation oc-curs over a surprisingly wide range of the processor uti-lization. As discussed earlier, our experimental resultswith the MPEG player on the Itsy also exhibit this os-cillation because that application exhibits the same step-function resource demands exhibited by our example.

We also simulated interval-based averaging policies thatused a pure average rather than an exponentially de-caying weighting function, but our simulations indi-cated that that policy would perform no better than theweighted averaging policy. Simple averaging suffersfrom the same problems experienced by the weightedaveraging if you do not average the appropriate period.

5.4 Summary of Results

We are omitting a detailed exposition on the schedulingbehavior of each scheduling policy primarily becausemost of them resulted in equivalent (and poor) behavior.Recall that the best possible scheduling goal for MPEGwould be to switch to a 132MHz speed and continue torender all the frames at that speed. No heuristic policythat we examined achieved this goal. Figure 8 showsthe clock setting behavior of the best policy we found.That policy uses the PAST heuristic (i.e. AVG0) and“pegs” the CPU speed either to 206MHz or 59MHz de-pending on the weight metric. The bounds on the hys-teresis where that a CPU utilization greater than 98%would cause the CPU to increase the clock speed and aCPU utilization less than 93% would decrease the clockspeed.

This policy is “best” because it never misses any dead-line (across all the applications) and it also saves a smallbut significant amount of energy. This last point is il-lustrated in Table 2. This table shows the 95% confi-dence interval for the average energy needed to run theMPEG application. The reduction in energy between206MHz and 132MHz occurs because the applicationwastes fewer cycles in the application idle loop used tomeet the frame delays for the MPEG clip. A � 8% en-ergy reduction occurs when we drop the processor volt-age to 1.23V – this is less than the 15% maximum reduc-tion we measured because the application uses resources(e.g. audio) that are not affected by voltage scaling.

60.0000

70.0000

80.0000

90.0000

100.0000

110.0000

120.0000

130.0000

140.0000

150.0000

160.0000

170.0000

180.0000

190.0000

200.0000

210.0000

0.0000 10.0000 20.0000 30.0000

Figure 8: Clock frequency for the MPEG application us-ing the best scheduling policy from our empirical study– the scheduling policy only select 59Mhz or 206MHzclock settings and changes clock settings frequently.This scheduling policy results in suboptimal energy sav-ings but avoids noticeable application slowdown.

The PAST policy we described results in a small but sta-tistically significant reduction in energy for the MPEGapplication. Allowing the processor to scale the voltagewhen the clock speed drops below 162.2MHz results inno statistical decrease.

We initially surmised that there is no improvement be-cause the cost of voltage and clock scaling on our plat-form out-weighs any gains. We measured the cost ofclock and voltage scaling using the DAQ. To measureclock scaling, we coded a tight loop that switched theprocessor clock as quickly as possible.

Before each clock change, we inverted the state of a spe-cific GPIO and used the DAQ to measure the intervalwith high precision. We took measurements when theclock changed across many different clock settings ( e.g.from 59 to 206MHz, from 191 to 206MHz and so on).

Clock scaling took approximately 200microseconds, in-dependent of the starting or target speed. During thattime, the processor can not execute instructions. Thus,frequency changing varies between 11; 200 clock peri-ods at 59MHz and 40; 000 clock periods at 200MHz.

We measured the time for the voltage to settle follow-ing a voltage change. It takes � 250 microseconds toreduce voltage from 1:5V to 1:23V ; in fact, the volt-age slowly reduces, drops below 1:23V and then rapidlysettles on 1:23V . Voltage increases were effectively in-stantaneous. We suspect the slow decay occurs becauseof capacitance; many processors use external decouplingcapacitors to provide sufficient current sourcing for pro-cessors that have widely varying current demands.

These measurements indicate that the time needed forclock and voltage changes are less than 2% of thescheduling interval; thus, we would be able to changethe clock or voltage on every scheduling decision withless than 2% overhead. The fact that we see little energyreduction is related to the limited energy savings possi-ble with the voltage scaling available on this platformand the efficacy of the policies we explored.

6 Conclusions and Future Work

Our implementation results were disappointing to us –we had hoped to be able to identify a prediction heuris-tic that resulted in significant energy savings, and wethought that the claims made by previous studies wouldbe born out by experimentation. Although we have

Algorithm Energy

Constant Speed @ 206.4 MHz, 1.5 Volts 85.59 - 86.49Constant Speed @ 132.7 MHz, 1.5 Volts 79.59 - 80.94Constant Speed @ 132.7 MHz, 1.23 Volts 73.76 - 74.41PAST, Peg - Peg, Thresholds: > 98% scales up, <93% scales down, 1.5 Volts

85.03 - 85.47

PAST, Peg - Peg, Thresholds: > 98% scales up, <93% scales down, Voltage Scaling @ 162.2 MHz

84.60 - 85.45

Table 2: Summary of Performance of Best Clock Scaling Algorithms

70

72

74

76

78

80

82

84

86

88

90

92

94

128 138 148 158 168 178 188 198

Clock Frequency

Pro

cess

or

Uti

lizat

ion

Figure 9: Non-linear change in Utilization with ClockFrequency (in MHz)

Processor Cycles/Mem. Cycles / CacheFreq. Reference Reference59.0 11 3973.7 11 3988.5 11 39

103.2 11 39118.0 13 41132.7 14 42147.5 14 49162.2 15 50176.9 18 60191.7 19 61206.4 20 69

Table 3: Memory access time in cycles for reading indi-vidual words as well as full cache lines.

found a policy that saves some energy, that policy leavesmuch to be desired. The policy causes many voltage andclock changes, which may incurr unnecessary overhead;this will be less of a problem as processors are betterdesigned to accommodate those changes. However, thepolicy did result in both the most responsive system be-havior and most significant energy reduction of all thepolicies we examined.

As with all empirical studies, there are anomalies in oursystem that we can not explain and that may have influ-enced our results. We found that the processor utiliza-tion does not always vary linearly with clock frequency.Figure 9 shows the processor utilization vs. clock fre-quency for the MPEG benchmark. There is a distinct“plateau” between 162MHz and 176.9MHz. We believethat this delay may be induced by the varying numberof clock cycles needed for memory accesses as the pro-cessor frequency changes, as shown in Table 3. Thattable shows the memory access time for EDODRAMfor reading individual words or a full cache line; thereis an obvious non-linear increase between 162MHz and176.9MHz. The potential speed mismatch between pro-cessor and memory has been noted by others [12], butwe have not devised a way to verify that this is the onlyfactor causing the non-linear behavior we noted.

This paper is the first step on an effort to provide ro-bust support for voltage and clock scheduling within theLinux operating system. Although our initial results aredisappointing, we feel that they serve to stop us from at-tempting to devise clever heuristics that could be usedfor clock scheduling. It may well be that Pering [11]reached a similar conclusion since their later publica-tions discontinued the use of hueristics, but their publi-cations don’t describe the implementation of their oper-ating system design or the rational behind the policiesused. Furthermore, they don’t describe how deadlinesare to be “synthesized” for applications such as Web,TalkingEditor and Web where there is no clear “dead-line”.

Our immediate future work is to provide “deadline”mechanisms in Linux. These deadlines are not preciselythe same mechanism needed in a true real-time O/S –in a RTOS, the application does not care if the deadlineis reached early, while energy scheduling would preferfor the deadline to be met as late as possible. A furtherchallenge we face will be to find a way to automaticallysynthesize those deadlines for complex applications.

7 Acknowledgments

We would like to thank the members of Compaq’s PaloAlto Research Labs who created the Itsy Pocket Com-puter and built the infrastructure that made this workpossible. We would also like to thank Wayne Mack foradapting the Itsy to fit our needs.

This work was supported in part by NSF grant CCR-9988548, a DARPA contract, and an equipment donationfrom Compaq Computer.

References

[1] Jay Heeb. The next generation of strongarm. InEmbedded Processor Forum. MDR, May 1999.

[2] Intel Corporation. Moblie pentium iii processor inbga2 and micro-pga2 packages. Datasheet Order#245302-002, 2000.

[3] David Linden (editor). Handbook of Batteries, 2nded. McGraw-Hill, New York, 1995.

[4] Carla-Fabiana Chiasserini and Ramesh R. Rao.Pulsed Battery Discharge in Communication De-vices. In The Fifth Annual ACM/IEEE Interna-tional Conference on Mobile Computing and Net-working, pages 88–95, August 1999.

[5] Trevor Pering, Tom Burd, and Robert Brodersen.The simulation of dynamic voltage scaling algo-rithms. In IEEE Symposium on Low Power Elec-tronics. IEEE Symposium on Low Power Electron-ics, 1995.

[6] Kinshuk Govil, Edwin Chan, , and Hal Wasserman.Comparing algorithms for dynamic speed-settingof a low-power cpu. In Proceedings of The FirstACM International Conference on Mobile Com-puting and Networking, Berkeley, CA, November1995.

[7] M. Weiser, B. Welch, A. Demers, and S. Shenker.Scheduling for reduced cpu energy. In First Sym-posium on Operating Systems Design and Imple-mentation, pages 13–23, November 1994.

[8] Trevor Pering. Private communication.

[9] J. Montanaro and et. al. A 160-MHz, 32-b, 0.5-WCMOS RISC Microprocessor. In Digital Techni-cal Journal, volume 9. Digital Equipment Corpo-ration, 1997.

[10] Dan Dobberpuhl. The Design of a High Perfor-mance Low Power Microprocessor. In Interna-tional Symposium on Low Power Electronics andDesign, pages 11–16, August 1996.

[11] Trevor Pering, Tom Burd, and Robert Brodersen.Voltage scheduling in the lparm microprocessorsystem. In Proceedings of the 2000 InternationalSymposium on Low Power Design, August 2000.

[12] Thomas L. Martin. Balancing Batteries, Power,and Performance: System Issues in CPU Speed-Setting for Mobile Computers. PhD thesis,Carnegie Mellon University, 1999.

[13] Keith I. Farkas, Jason Flinn, Godmar Back, DirkGrunwald, and Jennifer Anderson. Quantifyingthe energy consumption of a pocket computer anda java virtual machine. In Proceedings of theACM SIGMETRICS ’00 International Conferenceon Measurement and Modeling of Computer Sys-tems, 2000. (to appear).

[14] Transvirtual Technologies Inc.Kaffe Java Virtual Machine.http://www.transvirtual.com.

policies for dynamic clock scheduling - stanford...

Documents