using utility prediction models to dynamically choose...

Using Utility Prediction Models to Dynamically

Choose Program Thread Counts

Ryan W. Moore

Computer Science Department

University of Pittsburgh

Pittsburgh, USA

Email: [email protected]

Bruce R. Childers

Computer Science Department

University of Pittsburgh

Pittsburgh, USA

Email: [email protected]

Abstract—Multithreaded applications can simultaneously exe-cute on a chip multiprocessor computer, starting and stoppingwithout warning or pattern. The behavior of each program canbe different, interacting in unexpected ways, including causingcompetition for CPU cycles, which harms performance.

To maximize program performance in this type of dynamicexecution environment, the interactions among applications mustbe controlled. These interactions can be controlled by carefullychoosing the number of threads for each multithreaded appli-cation (i.e., a system configuration). To choose a configuration,we advocate using program utility models to predict applicationbehavior. Only a system that is capable of predicting andanalyzing performance under multiple configurations can choosethe best configuration and thus robustly meet its performancegoals. In this paper, we present such a system.

Our approach first gathers profile data. The profile data is usedby multiple linear regression to build a utility model. The modeltakes into account program scalability, susceptibility to interfer-ence, and any inherent leveling off of performance as a program’sthread count is increased. A utility model is constructed foreach application. When the system workload changes, the utilitymodels are consulted to find the new configuration that maximizessystem performance while meeting each program’s quality ofservice goals. We use multithreaded applications from PARSECto evaluate our approach. Compared to the best traditional policy,which does not consider variances in the dynamic workload, ourapproach simultaneously improves system throughput by 19.3%while meeting user performance constraints 28% more often.

Keywords-CMP; multicore; multiprocessor; multithreaded;performance; prediction;

I. INTRODUCTION

Chip multiprocessors (CMP) are ubiquitous for high per-

formance computing due to their efficiency [1]. As programs

launch and terminate, demand for CPUs changes, perhaps

radically. Application demand for CPU time can often be

configured on a per program basis by, for example, having

a command line parameter determine the number of com-

pute threads. Too many threads in an application can cause

harmful contention with other application threads (e.g., thread

migrations, poor cache locality, stealing CPU time from other

threads). Too few threads may underutilize the system.

To maximize performance, the choice of how many threads

to use for a program, i.e., thread count, should be influenced by

the number of CPUs and number of threads in other programs.

Coordination of thread count across programs ensures that

applications do not unnecessarily compete or underutilize the

system.

Manual coordination of the overall thread count (among all

programs) is burdensome and too slow to react to quickly

changing workloads. Even automatic methods to coordinate

the creation of threads when a program is launched have

limitations: as other applications come and go, the ideal

number of compute threads for a program may vary.

To better understand the performance implications of setting

the number of threads in multithreaded programs, two sensi-

tivity studies were performed. Figure 1 shows the aggregate

normalized throughput of a multicore system as the number

of threads in two programs, blackscholes and streamcluster,

is varied from 0 to 24 [2]. The two programs share 24 CPUs.

Both programs are from PARSEC. The throughput of each

program is normalized to its single-threaded throughput. Ag-

gregate normalized throughput is a system-wide performance

metric.

The figure shows that setting the number of threads greatly

affects system throughput. In this experiment the best con-

figuration gives 21 threads to blackscholes and 3 threads to

streamcluster. This configuration has an aggregate normalized

throughput of 14.34 (the bright white spot in the figure).

Because the best configuration directs blackscholes to create

far more threads than streamcluster, it can be inferred that

blackscholes more effectively uses multiple CPUs.

What if streamcluster finished its work and an instance

of swaptions, was started [2]? Figure 2 shows the aggregate

normalized throughput as the number of threads in blacksc-

holes and swaptions is varied. In this experiment, the best

configuration gives 1 thread to blackscholes and 24 threads

to swaptions. This configuration has an aggregate normalized

throughput of 22.5. It can be inferred that swaptions scales

better than blackscholes.

From the results in Figures 1 and 2, when maximizing

system throughput, it is important to properly choose, i.e.,

configure, the number of threads that each program will use.

Furthermore, the best choice for a program depends on other

executing programs. As programs enter and exit the system,

the number of threads that each one should use will change,

requiring dynamic reconfiguration.

However, the configuration that maximizes system through-

Fig. 1. Aggregate normalized throughput of blackscholes and streamcluster

as the number of threads in each program varies from 0 to 24

Fig. 2. Aggregate normalized throughput of blackscholes and swaptions asthe number of threads in each program varies from 0 to 24

put may starve an individual program. For example, the best

configuration to maximize system throughput when executing

blackscholes and swaptions (Figure 2) gave only a single

thread to blackscholes. It is possible that blackscholes will

take a very long time to finish its work, even though it is a

multithreaded application in a system with 24 CPUs.

As seen in the blackscholes and swaptions experiment, max-

imizing system throughput may starve individual programs.

Introducing quality of service (QoS) constraints solves this

problem. QoS constraints can avoid starvation by guaranteeing

that a program meets a particular level of performance. For

example, a QoS constraint might dictate that a program

achieve a throughput twice as large as when the application is

run with a single thread.

Each application should be configured so that it can meet

its QoS constraints. Any additional computational capability

of the system could be allocated to maximize system aggregate

throughput. This goal ensures that programs are not starved for

CPU time, while attempting to improve overall throughput.

In this paper, we consider the problem of choosing the

number of threads for each program to meet the above goal.

The number of threads used by an application is referred to

as its thread configuration. Choosing a configuration requires

considering each program’s scalability, in addition to the

effects of contention and interaction. We consider non-IO-

bound multithreaded programs (i.e., CPU-bound programs or

memory-bound programs). These programs are an important

class whose performance is easily affected by CPU contention

and configuration.

Program utility models can be used dynamically to choose

thread configurations. A utility model predicts the performance

of an application in a particular system configuration. Our

approach automatically builds program utility models offline

with multidimensional linear regression. The models also use

an automatically selected performance plateau. The perfor-

mance plateau accounts for when a program’s performance no

longer increases even if it is allowed to create more threads.

The models consider inter-process contention, without a priori

knowledge of the workload. Our approach consults the utility

models to find a configuration that maximizes system-wide

performance while meeting QoS constraints.

This paper makes the following contributions:

1) Techniques to automatically build utility models for mul-

tithreaded CPU-bound or memory-bound programs,

2) Utility models to maximize system throughput while

meeting program performance goals, and

3) Experimental evaluation of our techniques and compar-

isons to conventional policies which do not consider CPU

contention.

II. OVERVIEW OF APPROACH

Our approach to maximizing system throughput under QoS

constraints consists of two phases: an offline phase and an

online phase. These phases are shown in Figure 3. In the offline

phase (the left side of Figure 3), profiling is performed to

record a multithreaded program’s performance as its thread

count is changed. The profile data is used to generate a utility

model for a program.

In the online phase (the right side of Figure 3), the models of

executing programs are consulted. Programs are not admitted

into the system unless they have a utility model. The number

of threads to be used by each program is chosen. In the figure,

program 1 is configured with 6 compute threads and program

2 is configured with 4 compute threads. In this example, the

models predicted this configuration as the best one when the

two programs are co-runners.

The models are consulted when the workload changes.

When a new program enters the system, it will create threads to

obtain CPU time. Existing programs may respond to the arrival

of the new program (e.g., by decreasing their thread count).

Similarly, when a program finishes, CPU time may become

available. As a result, the number of threads in the remaining

Utility Models

Training Profile

Generate Utility

Models

Configure

Applications(Select #

threads)

App3 Model

App2 Model

App1 Model

App1, Th=6

App2, Th=4

Offline Model Construction Dynamic Configuration of Multithreaded Applications

Application begin/exit

Fig. 3. Phases to build and use predictive models

programs might be increased. The actual thread configurations

chosen will depend on the model predictions and the system

policy. In this paper, our system policy aims to meet program

QoS constraints and maximize aggregate throughput.

Thread configuration changes need to be done dynamically.

Previous work explains how existing, inflexible programs can

be modified to support dynamic online thread creation and

destruction [3]. For example, an application which uses work

stealing can simply spawn new threads to receive tasks from

a global work queue. A work stealing application can also

shutdown existing threads while preserving program correct-

ness, so long as one thread is left executing. Additionally,

automatic methods exist to combine program threads into a

single instruction stream [4].

III. OFFLINE PROFILING AND MODEL GENERATION

Profiling is used to observe program performance as the

number of compute threads is changed. We also consider a

second parameter, the number of other CPU-bound threads

in the system1, in order to model a program’s performance

changes due to CPU contention. To vary CPU contention,

we use an appropriate synthetic benchmark. The profiling

space is three dimensional. One dimension is the number of

CPU-bound threads (x) in the program being profiled, another

dimension is the number of other CPU-bound threads (y), and

the last dimension (z) is the performance of the program when

run with x compute threads on a system executing y other

CPU-bound threads. z is determined via profiling. The model

is a function of the number of threads in the program and

the number of other threads. The model predicts a program’s

performance relative to its single-threaded performance.

Ideally, the offline profiling step should examine program

performance at every setting of x and y. However, such

an exploration is prohibitively expensive. If a program is

allowed to use between 1 and n threads on an n CPU system,

a complete profiling run would require n2 data points (n

program threads · n other CPU-bound threads).

We constrain the set of profiling points by introducing a

parameter k. k determines the set of necessary profile points.

Increasing k results in fewer experiments, at the cost of less

profiling information. Along with the number of CPUs, the set

of points, P , to be profiled is:

1In this paper, we consider memory-bound threads to be CPU-bound.

P ={< x, y, profilePerformance(x, y) >}

wherex, y ∈ {1,n

k,2n

k,3n

k, ..., n}

After profiling, a model is built to fit the observed data

points. As more threads than CPUs are created, program

performance drops asymptotically due to the lack of access to

CPUs. Predicting a curve (e.g., by using polynomials of degree

2 or higher) is possible, but prone to overfitting and potentially

requires significant amounts of training data. Predicting a line

(i.e., with linear regression) requires less training data and

is less inclined to overfitting. Therefore, linear regression is

preferable if the data can be made linear. To account for

asymptotic growth in performance and to enable the use of

linear regression, we introduce a shift transformation:

shift(perf) =

perf

if progThreads+ otherThreads ≤ n

perf · (progThreads+ otherThreads)n

if progThreads+ otherThreads > n

The shift transformation accounts for the fact that if more

threads than CPUs are created, the threads will compete for

CPU time. The transformation compensates for the contention

by scaling the performance of a program in such situations.

For example, if two programs both create 24 threads on a 24

CPU system, each program effectively has access to only 12

CPUs. The shift transformation compensates for the program’s

diminished CPU access.

After the transformation, the profile data is better suited to

linear regression. At runtime, when the model is consulted to

make a prediction about performance, the transformation can

be undone, to consider the expected asymptotic growth.

Given the set of three-dimensional profiling points after

applying the shift transformation, multivariate linear regression

is used to find the three-dimensional plane that best predicts

the performance of the program given the profile data. Based

on our experience with program scalability, programs often

have linear growth as their thread count increases (at least for

a while).

Although programs may initially scale linearly, they do not

always continuously do so. We call this leveling off point the

performance plateau.

FINDPLATEAUMODEL(trainingData, plateauPoints)

1 // This procedure tries each possible performance

2 // plateau possibility. The best model is returned.

3 bestModel = ∅4 bestModelError = ∞5 for p ∈ plateauPoints

6 model = BUILDMODEL(trainingData, p)7 error = TESTMODEL(model , trainingData)8 if error < bestModelError

9 bestModel = model

10 bestModelError = modelError

11 return bestModel

Fig. 4. Algorithm for determining a performance plateau model

A program’s utility model must incorporate the scalabil-

ity limitations of the program itself. Therefore, the model

must account for the performance plateau. The plateau can

be automatically determined by building many models, each

assuming a different performance plateau point2. The model

whose predictions best match the observed profile data will be

selected. Procedure FINDPLATEAUMODEL in Figure 4 shows

this process.

Figure 5 shows the BUILDMODEL procedure used to con-

struct a model with a specified performance plateau. The

procedure ignores any profile data where the program’s thread

count is strictly greater than the value of the performance

plateau. Linear regression is used to find the plane that best

fits the remaining profile data.

Figure 6 shows canneal’s utility model. Canneal is from

PARSEC [2]. A flat region can be seen when 18 or more

threads are used. This region is canneal’s performance plateau.

The noticeable reduction in height as the number of other

threads in the system increases, shows that canneal is suscep-

tible to contention from other threads. Nevertheless, canneal

scales well; when run with 18 threads in an otherwise idle sys-

tem, it achieves a maximum throughput that is approximately

twelve times larger than the throughput achieved when run

with a single thread.

At runtime, a program’s utility model can be queried to

obtain a performance prediction. CONSULTMODEL (Figure 7)

shows the steps to query a model. If the model is queried about

a program’s performance when executing with 0 threads, the

expected performance is intuitively 0. If the model is being

queried about a point beyond the performance plateau, the

procedure instead makes a prediction about the program’s

performance at the start of the plateau. To make a prediction,

the procedure uses the model’s multivariate linear regression

model. The result from CONSULTMODEL is then scaled by

performing the inverse of the shift function.

Our presented techniques are not intended to be used with

IO-bound programs. We envision two approaches to handle

IO-bound programs: an implicit and an explicit approach. An

implicit approach could use the performance plateaus to model

2A performance plateau at ∞ indicates continued scaling.

BUILDMODEL(trainingData, plateauPt)

1 // This procedure builds a utility model,

2 // with a particular performance plateau point.

3 // Training points beyond plateauPt are ignored.

4 confPts = ∅5 perfPts = ∅6 for point ∈ trainingData

7 if point .threadConf .progThreads ≤ plateauPt

8 // The point is not past the plateau point.

9 confPts .append(point .threadConf )10 perfPts .append(point .result)11 model .plateauPt = plateauPt

12 model .linearModel = REGRESSION(confPts , perfPts)13 return model

Fig. 5. Algorithm for building a model

Fig. 6. Visualization of Canneal’s Utility Model

performance limits due to IO-boundedness. This approach is

readily incorporated into our current approach.

An explicit approach could make use of additional inputs

to the model (e.g., amount of IO-bandwidth available). These

additional inputs would increase the dimensionality of the

model, likely resulting in increased profiling costs. However,

by explicitly modeling IO-related inputs, the models may

become more powerful and/or accurate.

Incorporating techniques to support IO-bound programs is

future work. We next describe how to use a model to pick

the number of threads for an application and make runtime

configuration decisions.

IV. ONLINE CONFIGURATION

With a utility model for each program, it is possible to

use the models to simultaneously consider both individual

program performance and system-wide performance. With n

CPUs, programs can use 1 to n threads. With m programs

there are nm possible configurations.

CONSULTMODEL(model , conf )

1 if conf .progThreads == 02 return 0

3 if conf .progThreads > model .plateauPt

4 conf .progThreads = model .plateauPt

5 return MODEL.LINEARMODEL(conf .progThreads ,conf .otherThreads)

Fig. 7. Algorithm for consulting a model

A program’s QoS constraint is given at launch. A program’s

QoS is relative to its single-threaded performance. Thus a

QoS equal to 2 means that the program should be executed

in a way to achieve twice its single threaded performance

on an otherwise idle system. A QoS of 2 may may require

significantly more than 2 compute threads (e.g., if the program

scales poorly).

Figure 8 shows the function to choose the configuration. In

the figure, C is the system configuration (number of threads

in each program). Ci is the number of threads that program i

will create. The function Pi is the predictive utility model for

program i and corresponds to the CONSULTMODEL algorithm.

Pi has two parameters: the number of threads that program i

will use, and the number of other CPU-bound threads. Qi is

program i’s QoS constraint and n is the number of CPUs.

The QoS constraints are soft demands due to the use of

predictive models. Program utility models may, at times, make

mispredictions.

V. EXPERIMENTAL EVALUATION

We evaluated our methods to create utility models and to

choose the number of threads for a workload. We perform the

evaluation on a subset of PARSEC: blackscholes, bodytrack,

canneal, streamcluster, and swaptions [2]. The programs were

modified to track their throughput. Some programs from

PARSEC are not used due to limitations in available inputs

and/or thread configuration flexibility.

All experiments were run on a AMD 6124HE 48 CPU

Linux 2.6.39.1 machine. Conducting full experiments on a

48 CPU machine is extremely time intensive. Using Linux’s

setaffinity command we used two of four processor

sockets, limiting experiments to 24 CPUs.

A higher value of k usually leads to models that better

capture their program’s performance, but requires more pro-

filing experiments. As more profiling points are used, the

model accuracy improvements dwindle. Increasing k’s value

by 1 approximately doubles the number of profiling points as

compared to the previous value of k.

Using the fewest number of profiling points (k = 1) the

average mean squared error (MSE) of the models is 1.68. With

k = 2 the MSE improves dramatically to 1.03. Using k = 3results in continued improvements (MSE of 0.93). Using k = 4the MSE is 0.83, while using k = 6 results3 in an MSE of

0.76.

3k = 5 is not used because the number of cores (24) is not evenly divisibleby 5.

Some values of k lead to different models, particularly with

regards to performance plateau locations. For example, when

k = 4 the blackscholes model’s performance plateau is at ∞,

indicating continued linear scaling. However, when k = 8 the

model’s performance plateau is at 22 threads, suggesting that

blackscholes does not scale linearly past 22 threads. Only with

the additional profiling data was this fact evident.

For all training runs, we use k = 4. Therefore, to build each

program model 25 profiling points are gathered4. This choice

of k gave a good trade-off between number of experiments

performed and model accuracy. Furthermore, k = 4 allowed

building each program’s model relatively quickly (62.5 min-

utes) due to stopping the profiled process after a minute of

parallel execution.

The parameters of the generated models for k = 4 are

shown in Figure 9. The models are planar in three dimensions.

The models feature an offset and two slopes. A high scalability

sensitivity value indicates that the program scales well as

additional threads are used. Based on the models in the

figure, swaptions scales the best. For each additional thread, its

throughput increases by 0.79. Streamcluster, however, scales

the worst. For each additional thread, its throughput increases

by 0.26. The other programs (blackscholes, bodytrack, and

canneal) gain approximately 0.5 normalized throughput for

each additional thread.

A large interference sensitivity value indicates that a pro-

gram is easily affected by the presence of other compute

threads. Whether a program is positively or negatively affected

by the presence of other compute threads is determined by

the sign of the interference sensitivity value. Blackscholes

and streamcluster are insensitive to interference from other

compute threads. Their interference sensitivity values are very

close to 0. The other programs (bodytrack, canneal, and swap-

tions) lose approximately 0.1 throughput for each additional

compute thread from another application.

Several factors determine how a program is negatively

affected by the presence of other compute threads. As the

number of threads on the system increases, the operating

system may be forced to context switch out active threads. If a

switched out thread held a mutex, other threads may be forced

to wait for that thread to first be rescheduled, and then for it

to release the mutex. Another cause of interference may come

from the use of barriers. A barrier cannot be passed through

until all threads arrive at the barrier. Therefore, a thread which

reaches the barrier later than other threads (e.g., due to being

context switched out) may delay the progress of many other

threads.

A. Utility Model Accuracy

We examined the accuracy of a subset of the generated

models. We considered the same pairs of programs as seen

in Figures 1 and 2 (blackscholes with streamcluster, and

blackscholes with swaptions). Blackscholes, streamcluster, and

4All combinations of the program and the synthetic CPU-bound programhaving a thread count in {1, 6, 12, 18, 24}.

C = argmaxC

|C|∑

i=1

Pi(Ci,

|C|∑

j=1

Cj − Ci) such that∀i, Pi(Ci,

|C|∑

j=1

Cj − Ci) ≥ Qi andCi ≤ n

Fig. 8. Equation for determining the number of threads for the programs, C, given utility models P and goals Q

Program Offset Interference Scalability PerformanceName Sensitivity Sensitivity Plateau

blackscholes 0.65 0.00 0.48 ∞

bodytrack 2.55 -0.11 0.58 ∞

canneal 2.51 -0.13 0.52 18streamcluster 0.92 0.01 0.26 12swaptions 2.05 -0.13 0.79 ∞

Fig. 9. Generated program models

swaptions were chosen due to their varied behavior. Blacksc-

holes scales well and performs no communication between

threads. Swaptions also scales well, but does inter-thread

communication. Streamcluster scales poorly. Additional ex-

haustive accuracy experiments were not performed due to the

time intensive nature of executing pairs of programs in every

possible combination of thread settings (252 experiments for

program pair).

For each program in a pair, we varied the number of

threads from 0 to 24. The aggregate normalized throughput

of each configuration was recorded. The aggregate throughput

was compared to the generated models predicted aggregate

throughput. The difference between the achieved throughput

and the predicted throughput is the prediction error. Figures 10

and 11 show the error.

In Figure 10, the models perform well when the number of

threads in blackscholes and streamcluster are approximately

equal. If one of the programs has significantly fewer threads

than the other program (the top left and bottom right corners)

then the error becomes significant.

When using the generated models to predict blackscholes’s

and swaptions’s aggregate throughput, the results are more

encouraging (Figure 11). If the number of threads in blacksc-

holes and swaptions is approximately the same, the error is

low (≤ 10%). As with the previous experiment, if the number

of threads in each program differs greatly the prediction

error increases. Overall, when predicting the throughput of

blackscholes with swaptions, the models are significantly more

accurate than when predicting the throughput of blackscholes

with streamcluster.

Although we did not exhaustively examine every pairwise

combination, the results from these experiments suggest that

our models are reasonably accurate and can successfully

predict aggregate throughput. We also compared the predicted

versus actual aggregate normalized throughput for 500 ran-

domly chosen configurations (50 random configurations per

pair of programs). The average error was 19%. These results

match the trends presented in Figures 10 and 11.

Fig. 10. Prediction error of the blackscholes and streamcluster models ascompared to actual aggregate throughput

Fig. 11. Prediction error of the blackscholes and swaptions models ascompared to actual aggregate throughput

Program QoS Free Uniform Per Application CooperativeConstraints for All Partition Maximum Dynamic Allocation

blackscholes Q=< 2, 8 > 46.7% 80.0% 46.7% 66.7%bodytrack Q=< 2, 8 > 60.0% 53.3% 60.0% 66.7%canneal Q=< 2, 8 > 73.3% 53.3% 80.0% 73.3%streamcluster Q=< 2, 8 > 60.0% 46.7% 73.3% 80.0%swaptions Q=< 2, 8 > 20.0% 60.0% 20.0% 73.3%blackscholes Q=< 8, 2 > 73.3% 0.0% 80.0% 80.0%bodytrack Q=< 8, 2 > 60.0% 80.0% 66.7% 73.3%canneal Q=< 8, 2 > 20.0% 73.3% 26.7% 93.3%streamcluster Q=< 8, 2 > 0.0% 0.0% 0.0% 0.0%swaptions Q=< 8, 2 > 86.7% 100.0% 86.7% 93.3%

Average 50.0% 54.7% 54.0% 70.0%

Fig. 12. Percentage of the time that each policy was able to meet program QoS constraints

Program QoS Free Uniform Per Application CooperativeConstraints for All Partition Maximum Dynamic Allocation

blackscholes Q=< 2, 8 > 14.7 13.6 14.7 16.1bodytrack Q=< 2, 8 > 17.2 16.0 17.9 17.5canneal Q=< 2, 8 > 16.0 14.7 15.8 16.9streamcluster Q=< 2, 8 > 10.9 9.7 11.8 14.3swaptions Q=< 2, 8 > 20.4 18.0 20.3 21.1blackscholes Q=< 8, 2 > 14.7 13.6 14.7 15.2bodytrack Q=< 8, 2 > 17.2 16.0 17.9 17.5canneal Q=< 8, 2 > 16.0 14.7 15.8 16.9streamcluster Q=< 8, 2 > 10.9 9.7 11.8 14.4swaptions Q=< 8, 2 > 20.4 18.0 20.3 21.9

Average 15.8 14.4 16.1 17.2

Fig. 13. Average aggregate normalized throughput for each policy

Fig. 14. Blackscholes Q=< 2, 8 >

B. Comparisons to Static Policies

We did experiments to determine the ability of our models

to maximize system throughput while meeting pairs of pro-

grams’ QoS constraints. The effectiveness of our models was

compared to three static policies: 1) “free for all”, 2) uniform

partition, and 3) per-application maximum. The “free for all”

policy (policy 1) grants each program as many threads as there

are CPUs. This policy ensures that the system will be fully

utilized, at the risk of overcommitting CPUs. The uniform

partition policy (policy 2) considers the fact that pairs of

Fig. 15. Blackscholes Q=< 8, 2 >

programs will be executed. It grants each application an equal

share of the CPUs, allowing each program to create half as

many threads as CPUs. This policy avoids overcommitting the

system.

The per-application maximum policy (policy 3) takes into

account limits in program scalability. Each application is

granted as many threads as necessary to ensure that it will

reach its maximum throughput on an idle system (i.e., enough

threads to reach its performance plateau). If the program scales

continuously (i.e., a performance plateau of ∞), it is granted

as many threads as CPUs. For this policy, each program’s

performance plateau was determined in the same way as our

model generation algorithms. Therefore, this policy requires a

profiling step, whereas the other static policies do not. Figure 9

lists each program’s performance plateau. This policy may

overcommit the system.

Our policy uses the maximization function and generated

models when the system workload changes (Figures 8 and

9 respectively). We call this policy the cooperative dynamic

allocation policy. If the cooperative dynamic allocation policy

cannot find a configuration that meets QoS constraints, it uses

a configuration that, according to its utility models, maximizes

aggregate normalized throughput (i.e., the QoS constraints are

ignored).

The cooperative dynamic allocation policy performs an

exhaustive exploration of the configuration space. The space

is small enough (242 points) to permit such an approach. If

the number of programs is less than or equal to five (245

points), an exhaustive approach is still feasible and takes less

than 0.05 seconds to choose a configuration. For more than

five programs, the time required to choose a configuration

is more than one second and motivates the use of different

configuration search strategies (e.g., hill climbing).

To evaluate the policies, we performed five experiments

using the policies and two sets of QoS constraints. In each

experiment, a different program from among the five programs

was chosen. The selected program was executed during the en-

tire length of the experiment. Each experiment was broken up

into five one minute periods. In the first period, there was only

the selected program executing. In each of the later periods,

the first program was executed with a different program from

among the remaining four programs. The execution of each

program’s CPU intensive parallel region was overlapped.

Every 20 seconds, the number of work units completed

by each program was obtained. The number of work units

completed was normalized to a program’s single threaded

performance. The result is the throughput of the program over

every 20 second period.

We varied each experiment’s QoS constraints. We present

results for Q =< 2, 8 > and Q =< 8, 2 >. A constraint of

Q =< X, Y > dictates that the first program (i.e., the long

running one) in the pair should execute X times as fast as its

single-threaded performance. The second program in the pair

should execute Y times as fast. We purposefully selected Q

values that are difficult to meet.

Figure 12 shows the percentage of the time that each

policy meets the QoS constraints. Higher is better. The best

result for each experiment (row) is in bold. The streamcluster

Q =< 8, 2 > experiment is of note because no policy met its

QoS constraints. Streamcluster scales poorly and is unable to

achieve a normalized throughput above 5. Therefore, no policy

can achieve a QoS of 8.

One particularly interesting case happens in the canneal

Q =< 8, 2 > experiment. This is the eighth row of Figure 12.

Here, the “free for all policy” meets the QoS constraints only

20% of the time, while the per application policy meets QoS

constraints only 26.7% of the time. The uniform partition

policy does better by meeting QoS constraints 73.5% of the

time. The cooperative dynamic allocation policy does the best:

it satisfies the QoS constraints 93.3% of the time.

In general, the cooperative policy chooses better configu-

rations, resulting in QoS constraints being met more often.

This policy, on average, meets the QoS constraints 70.0% of

the time. The next best policy, uniform partition, meets the

constraints an average of 54.7% of the time.

We also examined the throughput achieved by each policy,

regardless of whether QoS constraints were met. By examining

the achieved throughputs and how often the policies were able

to meet QoS constraints, we can observe how conservative the

chosen configurations are.

Figure 13 shows the average aggregate normalized through-

put of each experiment. A higher value indicates a larger

throughput. The cooperative dynamic allocation policy con-

sistently outperforms the other policies. In eight out of ten

experiments, the dynamic policy has the highest average

aggregate throughput. In the other two experiments, the policy

achieves an aggregate throughput that is only 2% from the

best policy’s throughput. On average, our policy obtained a

throughput of 17.2. A conservative policy that meets the QoS

constraints would achieve an aggregate throughput of only 10

(Q1 +Q2 = 8 + 2 = 10).

To better understand the policies’ behavior, we closely

examined some representative experiments. Figures 14 to 16

show these selected results. Each graph shows a different

experiment. Because uniform partition was the best static

policy, we show a comparison between it and our cooperative

policy. The y-axis shows the aggregate normalized throughput

and the x-axis shows time. The x-axis is annotated with the

workload in each of the five periods of the experiment. If

during a 20 second interval a configuration did not result

in both programs meeting the QoS constraints, the period’s

aggregate throughput is marked as “Failed QoS.”

In Figure 14 (blackscholes Q =< 2, 8 >), the uniform

partition policy nearly ties our dynamic policy. Indeed, in two

20 second intervals our policy fails to meet the QoS constraints

even though the uniform partition policy does. One such

interval happens when executing blackscholes with bodytrack

and the other interval occurs when executing blackscholes

with swaptions. However, in the blackscholes Q =< 8, 2 >

experiment, the static uniform partition policy fails—never

meeting the QoS constraints (Figure 15). Because the uniform

partition policy uses the same configuration regardless of

the workload, it cannot adapt to different QoS constraints

or exploit program behavior. Our policy does a good job

choosing an appropriate configuration, and thus, meets the

QoS constraints 80% of the time.

Another interesting result occurs in the streamcluster Q =<

2, 8 > experiment (Figure 16). The QoS constraints dictate

that streamcluster execute twice as fast as its single-threaded

performance, and the other program (if any) run eight times

as fast (Q =< 2, 8 >). Although the dynamic policy does not

always select a configuration that meets both QoS constraints,

Fig. 16. Streamcluster Q=< 2, 8 >

Fig. 17. Blackscholes Q=< 2, 8 > Per Application QoS Results

it does a good job. Only during a single 20 second period

does the uniform partition policy outperform it (i.e., the first

20 second interval of streamcluster with bodytrack). When

the workload consists of streamcluster with swaptions (the

last 60 second period), both the static uniform partition policy

and our policy meet the QoS constraints. However, our policy

chooses a significantly better configuration, resulting in higher

aggregate throughput.

To analyze the weaknesses of our policy, we closely exam-

ined some experiments to see the reason(s) for failing to meet

QoS constraints. Figure 17 shows an analysis of blackscholes

Q =< 2, 8 >. It can be seen that the uniform partition policy

consistently meets blackscholes’s QoS constraint. However, in

two 20 second intervals our policy did not. This result suggests

that our policy overestimates blackscholes’s scalability and that

blackscholes should be allowed to create more threads. Addi-

tional threads for blackscholes likely will result in increased

throughput. Both policies failed to meet streamcluster’s QoS

constraint of 8. As discussed previously, it is impossible for

streamcluster to achieve such a throughput.

Figure 18 shows an analysis of streamcluster Q =< 2, 8 >.

Fig. 18. Streamcluster Q=< 2, 8 > Per Application QoS Results

It can be seen that when streamcluster is executed with body-

track, our policy fails to meet streamcluster’s QoS constraint

but does meet bodytrack’s constraint. The uniform partition

policy also fails to meet streamcluster’s QoS constraint two

out of three times during that period. Additionally, the uniform

partition policy fails once to meet bodytrack’s constraint dur-

ing that period. Because both policies failed to meet stream-

cluster’s QoS during that period, additional investigation may

be warranted (e.g., to find out if those programs negatively

interact in unexpected ways). These results also show that

in this experiment, our policy was able to always meet at

least one program’s QoS constraints. Conversely, the uniform

partition policy failed to meet both program’s QoS constraints

during an interval.

In summary, our cooperative dynamic allocation policy is

effective. It meets QoS constraints more often than any of

the static policies. It meets QoS constraints 28% more often

than the next best static policy (uniform partition). Our policy

simultaneously improves system throughput, by an average of

19.3% more than the next best policy (uniform partition).

VI. RELATED WORK

Techniques have been proposed to perform CPU allocation

in a variety of situations [5]–[7]. CPU allocation may be done

with or without the aid of application knowledge. Application

knowledge may be embedded in the source code (potentially

extractable later on) [8]–[11]. Knowledge may also come in

the form of programmer inserted hints [12] or gathered at run-

time, or via profiling [13]–[15]. Our work obtains application

knowledge through profiling.

Application knowledge can be used to build utility mod-

els and to make configuration decisions. Machine learning

approaches and statistical approaches are a natural fit for

building predictive and reactive algorithms that respond to

application knowledge [16]. Barnes et al., Ipek et al., and

Lee et al. used a variety of machine learning techniques (e.g.,

regression, neural networks) to predict program performance

in large dedicated clusters [17]–[19]. Unlike those authors,

our utility models are designed to predict performance on

a shared, non-dedicated CMP system, where contention for

processor time between multiple programs cannot be ignored.

Moseley et al. used linear regression and recursive partitioning

to make performance predictions about co-scheduled threads

on SMT contexts (2 contexts) [20]. Our work considers many

more contexts (24 CPUs) and makes decisions about how to

configure multiple programs simultaneously, without requiring

hardware performance counters.

VII. CONCLUSION

We described a new approach to build program utility

models. The models take into account the configuration of

both the program itself, as well as the configuration of the

system (number of other CPU-bound threads in the system,

number of CPUs in the system). We evaluated the accuracy

of several models generated by our method. We compared

the models’ ability to choose configurations that maximized

system aggregate throughput while meeting program QoS

constraints. As compared to static policies which do not

take into consideration the dynamic workload, our models

and dynamic configuration policy better met program QoS

constraints (28% more often) and increased system throughput

(19.3% higher). Therefore, dynamically using program utility

models is a viable solution to configuring application thread

usage on shared CMP machines.

ACKNOWLEDGEMENTS

This research was supported in part by the National Science

Foundation through awards CNS-1012070, CCF-0811295, and

CCF-0811352.

REFERENCES

[1] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, andK. Chang, “The case for a single-chip multiprocessor,” SIGPLAN

Not., vol. 31, no. 9, pp. 2–11, Sep. 1996. [Online]. Available:http://dx.doi.org/10.1145/248209.237140

[2] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSECbenchmark suite: characterization and architectural implications,”in Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques, ser. PACT ’08. NewYork, NY, USA: ACM, 2008, pp. 72–81. [Online]. Available:http://dx.doi.org/10.1145/1454115.1454128

[3] R. W. Moore and B. R. Childers, “Inflation and deflation of self-adaptiveapplications,” in Proceeding of the 6th international symposium on

Software engineering for adaptive and self-managing systems, ser.SEAMS ’11. New York, NY, USA: ACM, 2011, pp. 228–237.[Online]. Available: http://dx.doi.org/10.1145/1988008.1988041

[4] J. Lee, H. Wu, M. Ravichandran, and N. Clark, “Thread tailor:dynamically weaving threads together for efficient, adaptive parallelapplications,” in Proceedings of the 37th annual internationalsymposium on Computer architecture, ser. ISCA ’10. New York,NY, USA: ACM, 2010, pp. 270–279. [Online]. Available: http://dx.doi.org/10.1145/1815961.1815996

[5] S. B. Ahmad, “On improved processor allocation in 2D mesh-basedmulticomputers: controlled splitting of parallel requests,” in Proceedings

of the 2011 International Conference on Communication, Computing

& Security, ser. ICCCS ’11. New York, NY, USA: ACM, 2011,pp. 204–209. [Online]. Available: http://dx.doi.org/10.1145/1947940.1947984

[6] L. F. Leung, C. Y. Tsui, and W. H. Ki, “Minimizing energy consumptionof multiple-processors-core systems with simultaneous task allocation,scheduling and voltage assignment,” in Proceedings of the 2004 Asia

and South Pacific Design Automation Conference, ser. ASP-DAC ’04.Piscataway, NJ, USA: IEEE Press, 2004, pp. 647–652. [Online].Available: http://portal.acm.org/citation.cfm?id=1015090.1015267

[7] M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, andO. Ozturk, “Optimizing shared cache behavior of chip multiprocessors,”in Proceedings of the 42nd Annual IEEE/ACM International Symposium

on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM,2009, pp. 505–516. [Online]. Available: http://dx.doi.org/10.1145/1669112.1669176

[8] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-oriented approach to non-uniform cluster computing,” SIGPLANNot., vol. 40, pp. 519–538, Oct. 2005. [Online]. Available: http://dx.doi.org/10.1145/1094811.1094852

[9] C. E. Leiserson, “The cilk++ concurrency platform,” in Proceedings

of the 46th Annual Design Automation Conference, ser. DAC ’09.New York, NY, USA: ACM, 2009, pp. 522–527. [Online]. Available:http://dx.doi.org/10.1145/1629911.1630048

[10] Architecture Review Board, “Openmp application program interfacev3.0,” http://www.openmp.org/mp-documents/spec30.pdf.

[11] Message Passing Interface Forum, “Mpi: A message-passing inter-face standard version 2.2,” http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.

[12] G. Chen and M. Kandemir, “Optimizing embedded applications usingprogrammer-inserted hints,” in Proceedings of the 2005 Asia and

South Pacific Design Automation Conference, ser. ASP-DAC ’05.New York, NY, USA: ACM, 2005, pp. 157–160. [Online]. Available:http://dx.doi.org/10.1145/1120725.1120794

[13] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu, and T. Nakatani,“Design and evaluation of dynamic optimizations for a java just-in-timecompiler,” ACM Trans. Program. Lang. Syst., vol. 27, no. 4, pp.732–785, Jul. 2005. [Online]. Available: http://dx.doi.org/10.1145/1075382.1075386

[14] M. C. Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos,“Online power-performance adaptation of multithreaded programs usinghardware event-based prediction,” in Proceedings of the 20th annual

international conference on Supercomputing, ser. ICS ’06. NewYork, NY, USA: ACM, 2006, pp. 157–166. [Online]. Available:http://dx.doi.org/10.1145/1183401.1183426

[15] M. Itzkowitz and Y. Maruyama, “HPC profiling with the sun studioperformance tools,” in Tools for High Performance Computing 2009,M. S. Muller, M. M. Resch, A. Schulz, and W. E. Nagel, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, ch. 6, pp. 67–93.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-11261-4\ 6

[16] J. F. Martinez and E. Ipek, “Dynamic multicore resource management:A machine learning approach,” IEEE Micro, vol. 29, no. 5, pp. 8–17,Sep. 2009. [Online]. Available: http://dx.doi.org/10.1109/MM.2009.77

[17] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski,and M. Schulz, “A regression-based approach to scalability prediction,”in Proceedings of the 22nd annual international conference on

Supercomputing, ser. ICS ’08. New York, NY, USA: ACM, 2008,pp. 368–377. [Online]. Available: http://dx.doi.org/10.1145/1375527.1375580

[18] E. Ipek, B. de Supinski, M. Schulz, and S. McKee, “An approachto performance prediction for parallel applications Euro-Par 2005parallel processing,” in Euro-Par 2005 Parallel Processing, ser. LectureNotes in Computer Science, J. Cunha and P. Medeiros, Eds. Berlin,Heidelberg: Springer Berlin / Heidelberg, 2005, vol. 3648, ch. 24, pp.627–628. [Online]. Available: http://dx.doi.org/10.1007/11549468\ 24

[19] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh,and S. A. McKee, “Methods of inference and learning for performancemodeling of parallel applications,” in Proceedings of the 12thACM SIGPLAN symposium on Principles and practice of parallel

programming, ser. PPoPP ’07. New York, NY, USA: ACM, 2007,pp. 249–258. [Online]. Available: http://dx.doi.org/10.1145/1229428.1229479

[20] T. Moseley, D. Grunwald, J. L. Kihm, and D. A. Connors, “Methodsfor modeling resource contention on simultaneous multithreadingprocessors,” Computer Design, International Conference on, vol. 0, pp.373–380, 2005. [Online]. Available: http://dx.doi.org/10.1109/ICCD.2005.74

using utility prediction models to dynamically choose...

Documents