using utility prediction models to dynamically choose...
TRANSCRIPT
Using Utility Prediction Models to Dynamically
Choose Program Thread Counts
Ryan W. Moore
Computer Science Department
University of Pittsburgh
Pittsburgh, USA
Email: [email protected]
Bruce R. Childers
Computer Science Department
University of Pittsburgh
Pittsburgh, USA
Email: [email protected]
Abstract—Multithreaded applications can simultaneously exe-cute on a chip multiprocessor computer, starting and stoppingwithout warning or pattern. The behavior of each program canbe different, interacting in unexpected ways, including causingcompetition for CPU cycles, which harms performance.
To maximize program performance in this type of dynamicexecution environment, the interactions among applications mustbe controlled. These interactions can be controlled by carefullychoosing the number of threads for each multithreaded appli-cation (i.e., a system configuration). To choose a configuration,we advocate using program utility models to predict applicationbehavior. Only a system that is capable of predicting andanalyzing performance under multiple configurations can choosethe best configuration and thus robustly meet its performancegoals. In this paper, we present such a system.
Our approach first gathers profile data. The profile data is usedby multiple linear regression to build a utility model. The modeltakes into account program scalability, susceptibility to interfer-ence, and any inherent leveling off of performance as a program’sthread count is increased. A utility model is constructed foreach application. When the system workload changes, the utilitymodels are consulted to find the new configuration that maximizessystem performance while meeting each program’s quality ofservice goals. We use multithreaded applications from PARSECto evaluate our approach. Compared to the best traditional policy,which does not consider variances in the dynamic workload, ourapproach simultaneously improves system throughput by 19.3%while meeting user performance constraints 28% more often.
Keywords-CMP; multicore; multiprocessor; multithreaded;performance; prediction;
I. INTRODUCTION
Chip multiprocessors (CMP) are ubiquitous for high per-
formance computing due to their efficiency [1]. As programs
launch and terminate, demand for CPUs changes, perhaps
radically. Application demand for CPU time can often be
configured on a per program basis by, for example, having
a command line parameter determine the number of com-
pute threads. Too many threads in an application can cause
harmful contention with other application threads (e.g., thread
migrations, poor cache locality, stealing CPU time from other
threads). Too few threads may underutilize the system.
To maximize performance, the choice of how many threads
to use for a program, i.e., thread count, should be influenced by
the number of CPUs and number of threads in other programs.
Coordination of thread count across programs ensures that
applications do not unnecessarily compete or underutilize the
system.
Manual coordination of the overall thread count (among all
programs) is burdensome and too slow to react to quickly
changing workloads. Even automatic methods to coordinate
the creation of threads when a program is launched have
limitations: as other applications come and go, the ideal
number of compute threads for a program may vary.
To better understand the performance implications of setting
the number of threads in multithreaded programs, two sensi-
tivity studies were performed. Figure 1 shows the aggregate
normalized throughput of a multicore system as the number
of threads in two programs, blackscholes and streamcluster,
is varied from 0 to 24 [2]. The two programs share 24 CPUs.
Both programs are from PARSEC. The throughput of each
program is normalized to its single-threaded throughput. Ag-
gregate normalized throughput is a system-wide performance
metric.
The figure shows that setting the number of threads greatly
affects system throughput. In this experiment the best con-
figuration gives 21 threads to blackscholes and 3 threads to
streamcluster. This configuration has an aggregate normalized
throughput of 14.34 (the bright white spot in the figure).
Because the best configuration directs blackscholes to create
far more threads than streamcluster, it can be inferred that
blackscholes more effectively uses multiple CPUs.
What if streamcluster finished its work and an instance
of swaptions, was started [2]? Figure 2 shows the aggregate
normalized throughput as the number of threads in blacksc-
holes and swaptions is varied. In this experiment, the best
configuration gives 1 thread to blackscholes and 24 threads
to swaptions. This configuration has an aggregate normalized
throughput of 22.5. It can be inferred that swaptions scales
better than blackscholes.
From the results in Figures 1 and 2, when maximizing
system throughput, it is important to properly choose, i.e.,
configure, the number of threads that each program will use.
Furthermore, the best choice for a program depends on other
executing programs. As programs enter and exit the system,
the number of threads that each one should use will change,
requiring dynamic reconfiguration.
However, the configuration that maximizes system through-
Fig. 1. Aggregate normalized throughput of blackscholes and streamcluster
as the number of threads in each program varies from 0 to 24
Fig. 2. Aggregate normalized throughput of blackscholes and swaptions asthe number of threads in each program varies from 0 to 24
put may starve an individual program. For example, the best
configuration to maximize system throughput when executing
blackscholes and swaptions (Figure 2) gave only a single
thread to blackscholes. It is possible that blackscholes will
take a very long time to finish its work, even though it is a
multithreaded application in a system with 24 CPUs.
As seen in the blackscholes and swaptions experiment, max-
imizing system throughput may starve individual programs.
Introducing quality of service (QoS) constraints solves this
problem. QoS constraints can avoid starvation by guaranteeing
that a program meets a particular level of performance. For
example, a QoS constraint might dictate that a program
achieve a throughput twice as large as when the application is
run with a single thread.
Each application should be configured so that it can meet
its QoS constraints. Any additional computational capability
of the system could be allocated to maximize system aggregate
throughput. This goal ensures that programs are not starved for
CPU time, while attempting to improve overall throughput.
In this paper, we consider the problem of choosing the
number of threads for each program to meet the above goal.
The number of threads used by an application is referred to
as its thread configuration. Choosing a configuration requires
considering each program’s scalability, in addition to the
effects of contention and interaction. We consider non-IO-
bound multithreaded programs (i.e., CPU-bound programs or
memory-bound programs). These programs are an important
class whose performance is easily affected by CPU contention
and configuration.
Program utility models can be used dynamically to choose
thread configurations. A utility model predicts the performance
of an application in a particular system configuration. Our
approach automatically builds program utility models offline
with multidimensional linear regression. The models also use
an automatically selected performance plateau. The perfor-
mance plateau accounts for when a program’s performance no
longer increases even if it is allowed to create more threads.
The models consider inter-process contention, without a priori
knowledge of the workload. Our approach consults the utility
models to find a configuration that maximizes system-wide
performance while meeting QoS constraints.
This paper makes the following contributions:
1) Techniques to automatically build utility models for mul-
tithreaded CPU-bound or memory-bound programs,
2) Utility models to maximize system throughput while
meeting program performance goals, and
3) Experimental evaluation of our techniques and compar-
isons to conventional policies which do not consider CPU
contention.
II. OVERVIEW OF APPROACH
Our approach to maximizing system throughput under QoS
constraints consists of two phases: an offline phase and an
online phase. These phases are shown in Figure 3. In the offline
phase (the left side of Figure 3), profiling is performed to
record a multithreaded program’s performance as its thread
count is changed. The profile data is used to generate a utility
model for a program.
In the online phase (the right side of Figure 3), the models of
executing programs are consulted. Programs are not admitted
into the system unless they have a utility model. The number
of threads to be used by each program is chosen. In the figure,
program 1 is configured with 6 compute threads and program
2 is configured with 4 compute threads. In this example, the
models predicted this configuration as the best one when the
two programs are co-runners.
The models are consulted when the workload changes.
When a new program enters the system, it will create threads to
obtain CPU time. Existing programs may respond to the arrival
of the new program (e.g., by decreasing their thread count).
Similarly, when a program finishes, CPU time may become
available. As a result, the number of threads in the remaining
Utility Models
Training Profile
Generate Utility
Models
Configure
Applications(Select #
threads)
App3 Model
App2 Model
App1 Model
App1, Th=6
App2, Th=4
Offline Model Construction Dynamic Configuration of Multithreaded Applications
Application begin/exit
Fig. 3. Phases to build and use predictive models
programs might be increased. The actual thread configurations
chosen will depend on the model predictions and the system
policy. In this paper, our system policy aims to meet program
QoS constraints and maximize aggregate throughput.
Thread configuration changes need to be done dynamically.
Previous work explains how existing, inflexible programs can
be modified to support dynamic online thread creation and
destruction [3]. For example, an application which uses work
stealing can simply spawn new threads to receive tasks from
a global work queue. A work stealing application can also
shutdown existing threads while preserving program correct-
ness, so long as one thread is left executing. Additionally,
automatic methods exist to combine program threads into a
single instruction stream [4].
III. OFFLINE PROFILING AND MODEL GENERATION
Profiling is used to observe program performance as the
number of compute threads is changed. We also consider a
second parameter, the number of other CPU-bound threads
in the system1, in order to model a program’s performance
changes due to CPU contention. To vary CPU contention,
we use an appropriate synthetic benchmark. The profiling
space is three dimensional. One dimension is the number of
CPU-bound threads (x) in the program being profiled, another
dimension is the number of other CPU-bound threads (y), and
the last dimension (z) is the performance of the program when
run with x compute threads on a system executing y other
CPU-bound threads. z is determined via profiling. The model
is a function of the number of threads in the program and
the number of other threads. The model predicts a program’s
performance relative to its single-threaded performance.
Ideally, the offline profiling step should examine program
performance at every setting of x and y. However, such
an exploration is prohibitively expensive. If a program is
allowed to use between 1 and n threads on an n CPU system,
a complete profiling run would require n2 data points (n
program threads · n other CPU-bound threads).
We constrain the set of profiling points by introducing a
parameter k. k determines the set of necessary profile points.
Increasing k results in fewer experiments, at the cost of less
profiling information. Along with the number of CPUs, the set
of points, P , to be profiled is:
1In this paper, we consider memory-bound threads to be CPU-bound.
P ={< x, y, profilePerformance(x, y) >}
wherex, y ∈ {1,n
k,2n
k,3n
k, ..., n}
After profiling, a model is built to fit the observed data
points. As more threads than CPUs are created, program
performance drops asymptotically due to the lack of access to
CPUs. Predicting a curve (e.g., by using polynomials of degree
2 or higher) is possible, but prone to overfitting and potentially
requires significant amounts of training data. Predicting a line
(i.e., with linear regression) requires less training data and
is less inclined to overfitting. Therefore, linear regression is
preferable if the data can be made linear. To account for
asymptotic growth in performance and to enable the use of
linear regression, we introduce a shift transformation:
shift(perf) =
perf
if progThreads+ otherThreads ≤ n
perf · (progThreads+ otherThreads)n
if progThreads+ otherThreads > n
The shift transformation accounts for the fact that if more
threads than CPUs are created, the threads will compete for
CPU time. The transformation compensates for the contention
by scaling the performance of a program in such situations.
For example, if two programs both create 24 threads on a 24
CPU system, each program effectively has access to only 12
CPUs. The shift transformation compensates for the program’s
diminished CPU access.
After the transformation, the profile data is better suited to
linear regression. At runtime, when the model is consulted to
make a prediction about performance, the transformation can
be undone, to consider the expected asymptotic growth.
Given the set of three-dimensional profiling points after
applying the shift transformation, multivariate linear regression
is used to find the three-dimensional plane that best predicts
the performance of the program given the profile data. Based
on our experience with program scalability, programs often
have linear growth as their thread count increases (at least for
a while).
Although programs may initially scale linearly, they do not
always continuously do so. We call this leveling off point the
performance plateau.
FINDPLATEAUMODEL(trainingData, plateauPoints)
1 // This procedure tries each possible performance
2 // plateau possibility. The best model is returned.
3 bestModel = ∅4 bestModelError = ∞5 for p ∈ plateauPoints
6 model = BUILDMODEL(trainingData, p)7 error = TESTMODEL(model , trainingData)8 if error < bestModelError
9 bestModel = model
10 bestModelError = modelError
11 return bestModel
Fig. 4. Algorithm for determining a performance plateau model
A program’s utility model must incorporate the scalabil-
ity limitations of the program itself. Therefore, the model
must account for the performance plateau. The plateau can
be automatically determined by building many models, each
assuming a different performance plateau point2. The model
whose predictions best match the observed profile data will be
selected. Procedure FINDPLATEAUMODEL in Figure 4 shows
this process.
Figure 5 shows the BUILDMODEL procedure used to con-
struct a model with a specified performance plateau. The
procedure ignores any profile data where the program’s thread
count is strictly greater than the value of the performance
plateau. Linear regression is used to find the plane that best
fits the remaining profile data.
Figure 6 shows canneal’s utility model. Canneal is from
PARSEC [2]. A flat region can be seen when 18 or more
threads are used. This region is canneal’s performance plateau.
The noticeable reduction in height as the number of other
threads in the system increases, shows that canneal is suscep-
tible to contention from other threads. Nevertheless, canneal
scales well; when run with 18 threads in an otherwise idle sys-
tem, it achieves a maximum throughput that is approximately
twelve times larger than the throughput achieved when run
with a single thread.
At runtime, a program’s utility model can be queried to
obtain a performance prediction. CONSULTMODEL (Figure 7)
shows the steps to query a model. If the model is queried about
a program’s performance when executing with 0 threads, the
expected performance is intuitively 0. If the model is being
queried about a point beyond the performance plateau, the
procedure instead makes a prediction about the program’s
performance at the start of the plateau. To make a prediction,
the procedure uses the model’s multivariate linear regression
model. The result from CONSULTMODEL is then scaled by
performing the inverse of the shift function.
Our presented techniques are not intended to be used with
IO-bound programs. We envision two approaches to handle
IO-bound programs: an implicit and an explicit approach. An
implicit approach could use the performance plateaus to model
2A performance plateau at ∞ indicates continued scaling.
BUILDMODEL(trainingData, plateauPt)
1 // This procedure builds a utility model,
2 // with a particular performance plateau point.
3 // Training points beyond plateauPt are ignored.
4 confPts = ∅5 perfPts = ∅6 for point ∈ trainingData
7 if point .threadConf .progThreads ≤ plateauPt
8 // The point is not past the plateau point.
9 confPts .append(point .threadConf )10 perfPts .append(point .result)11 model .plateauPt = plateauPt
12 model .linearModel = REGRESSION(confPts , perfPts)13 return model
Fig. 5. Algorithm for building a model
Fig. 6. Visualization of Canneal’s Utility Model
performance limits due to IO-boundedness. This approach is
readily incorporated into our current approach.
An explicit approach could make use of additional inputs
to the model (e.g., amount of IO-bandwidth available). These
additional inputs would increase the dimensionality of the
model, likely resulting in increased profiling costs. However,
by explicitly modeling IO-related inputs, the models may
become more powerful and/or accurate.
Incorporating techniques to support IO-bound programs is
future work. We next describe how to use a model to pick
the number of threads for an application and make runtime
configuration decisions.
IV. ONLINE CONFIGURATION
With a utility model for each program, it is possible to
use the models to simultaneously consider both individual
program performance and system-wide performance. With n
CPUs, programs can use 1 to n threads. With m programs
there are nm possible configurations.
CONSULTMODEL(model , conf )
1 if conf .progThreads == 02 return 0
3 if conf .progThreads > model .plateauPt
4 conf .progThreads = model .plateauPt
5 return MODEL.LINEARMODEL(conf .progThreads ,conf .otherThreads)
Fig. 7. Algorithm for consulting a model
A program’s QoS constraint is given at launch. A program’s
QoS is relative to its single-threaded performance. Thus a
QoS equal to 2 means that the program should be executed
in a way to achieve twice its single threaded performance
on an otherwise idle system. A QoS of 2 may may require
significantly more than 2 compute threads (e.g., if the program
scales poorly).
Figure 8 shows the function to choose the configuration. In
the figure, C is the system configuration (number of threads
in each program). Ci is the number of threads that program i
will create. The function Pi is the predictive utility model for
program i and corresponds to the CONSULTMODEL algorithm.
Pi has two parameters: the number of threads that program i
will use, and the number of other CPU-bound threads. Qi is
program i’s QoS constraint and n is the number of CPUs.
The QoS constraints are soft demands due to the use of
predictive models. Program utility models may, at times, make
mispredictions.
V. EXPERIMENTAL EVALUATION
We evaluated our methods to create utility models and to
choose the number of threads for a workload. We perform the
evaluation on a subset of PARSEC: blackscholes, bodytrack,
canneal, streamcluster, and swaptions [2]. The programs were
modified to track their throughput. Some programs from
PARSEC are not used due to limitations in available inputs
and/or thread configuration flexibility.
All experiments were run on a AMD 6124HE 48 CPU
Linux 2.6.39.1 machine. Conducting full experiments on a
48 CPU machine is extremely time intensive. Using Linux’s
setaffinity command we used two of four processor
sockets, limiting experiments to 24 CPUs.
A higher value of k usually leads to models that better
capture their program’s performance, but requires more pro-
filing experiments. As more profiling points are used, the
model accuracy improvements dwindle. Increasing k’s value
by 1 approximately doubles the number of profiling points as
compared to the previous value of k.
Using the fewest number of profiling points (k = 1) the
average mean squared error (MSE) of the models is 1.68. With
k = 2 the MSE improves dramatically to 1.03. Using k = 3results in continued improvements (MSE of 0.93). Using k = 4the MSE is 0.83, while using k = 6 results3 in an MSE of
0.76.
3k = 5 is not used because the number of cores (24) is not evenly divisibleby 5.
Some values of k lead to different models, particularly with
regards to performance plateau locations. For example, when
k = 4 the blackscholes model’s performance plateau is at ∞,
indicating continued linear scaling. However, when k = 8 the
model’s performance plateau is at 22 threads, suggesting that
blackscholes does not scale linearly past 22 threads. Only with
the additional profiling data was this fact evident.
For all training runs, we use k = 4. Therefore, to build each
program model 25 profiling points are gathered4. This choice
of k gave a good trade-off between number of experiments
performed and model accuracy. Furthermore, k = 4 allowed
building each program’s model relatively quickly (62.5 min-
utes) due to stopping the profiled process after a minute of
parallel execution.
The parameters of the generated models for k = 4 are
shown in Figure 9. The models are planar in three dimensions.
The models feature an offset and two slopes. A high scalability
sensitivity value indicates that the program scales well as
additional threads are used. Based on the models in the
figure, swaptions scales the best. For each additional thread, its
throughput increases by 0.79. Streamcluster, however, scales
the worst. For each additional thread, its throughput increases
by 0.26. The other programs (blackscholes, bodytrack, and
canneal) gain approximately 0.5 normalized throughput for
each additional thread.
A large interference sensitivity value indicates that a pro-
gram is easily affected by the presence of other compute
threads. Whether a program is positively or negatively affected
by the presence of other compute threads is determined by
the sign of the interference sensitivity value. Blackscholes
and streamcluster are insensitive to interference from other
compute threads. Their interference sensitivity values are very
close to 0. The other programs (bodytrack, canneal, and swap-
tions) lose approximately 0.1 throughput for each additional
compute thread from another application.
Several factors determine how a program is negatively
affected by the presence of other compute threads. As the
number of threads on the system increases, the operating
system may be forced to context switch out active threads. If a
switched out thread held a mutex, other threads may be forced
to wait for that thread to first be rescheduled, and then for it
to release the mutex. Another cause of interference may come
from the use of barriers. A barrier cannot be passed through
until all threads arrive at the barrier. Therefore, a thread which
reaches the barrier later than other threads (e.g., due to being
context switched out) may delay the progress of many other
threads.
A. Utility Model Accuracy
We examined the accuracy of a subset of the generated
models. We considered the same pairs of programs as seen
in Figures 1 and 2 (blackscholes with streamcluster, and
blackscholes with swaptions). Blackscholes, streamcluster, and
4All combinations of the program and the synthetic CPU-bound programhaving a thread count in {1, 6, 12, 18, 24}.
C = argmaxC
|C|∑
i=1
Pi(Ci,
|C|∑
j=1
Cj − Ci) such that∀i, Pi(Ci,
|C|∑
j=1
Cj − Ci) ≥ Qi andCi ≤ n
Fig. 8. Equation for determining the number of threads for the programs, C, given utility models P and goals Q
Program Offset Interference Scalability PerformanceName Sensitivity Sensitivity Plateau
blackscholes 0.65 0.00 0.48 ∞
bodytrack 2.55 -0.11 0.58 ∞
canneal 2.51 -0.13 0.52 18streamcluster 0.92 0.01 0.26 12swaptions 2.05 -0.13 0.79 ∞
Fig. 9. Generated program models
swaptions were chosen due to their varied behavior. Blacksc-
holes scales well and performs no communication between
threads. Swaptions also scales well, but does inter-thread
communication. Streamcluster scales poorly. Additional ex-
haustive accuracy experiments were not performed due to the
time intensive nature of executing pairs of programs in every
possible combination of thread settings (252 experiments for
program pair).
For each program in a pair, we varied the number of
threads from 0 to 24. The aggregate normalized throughput
of each configuration was recorded. The aggregate throughput
was compared to the generated models predicted aggregate
throughput. The difference between the achieved throughput
and the predicted throughput is the prediction error. Figures 10
and 11 show the error.
In Figure 10, the models perform well when the number of
threads in blackscholes and streamcluster are approximately
equal. If one of the programs has significantly fewer threads
than the other program (the top left and bottom right corners)
then the error becomes significant.
When using the generated models to predict blackscholes’s
and swaptions’s aggregate throughput, the results are more
encouraging (Figure 11). If the number of threads in blacksc-
holes and swaptions is approximately the same, the error is
low (≤ 10%). As with the previous experiment, if the number
of threads in each program differs greatly the prediction
error increases. Overall, when predicting the throughput of
blackscholes with swaptions, the models are significantly more
accurate than when predicting the throughput of blackscholes
with streamcluster.
Although we did not exhaustively examine every pairwise
combination, the results from these experiments suggest that
our models are reasonably accurate and can successfully
predict aggregate throughput. We also compared the predicted
versus actual aggregate normalized throughput for 500 ran-
domly chosen configurations (50 random configurations per
pair of programs). The average error was 19%. These results
match the trends presented in Figures 10 and 11.
Fig. 10. Prediction error of the blackscholes and streamcluster models ascompared to actual aggregate throughput
Fig. 11. Prediction error of the blackscholes and swaptions models ascompared to actual aggregate throughput
Program QoS Free Uniform Per Application CooperativeConstraints for All Partition Maximum Dynamic Allocation
blackscholes Q=< 2, 8 > 46.7% 80.0% 46.7% 66.7%bodytrack Q=< 2, 8 > 60.0% 53.3% 60.0% 66.7%canneal Q=< 2, 8 > 73.3% 53.3% 80.0% 73.3%streamcluster Q=< 2, 8 > 60.0% 46.7% 73.3% 80.0%swaptions Q=< 2, 8 > 20.0% 60.0% 20.0% 73.3%blackscholes Q=< 8, 2 > 73.3% 0.0% 80.0% 80.0%bodytrack Q=< 8, 2 > 60.0% 80.0% 66.7% 73.3%canneal Q=< 8, 2 > 20.0% 73.3% 26.7% 93.3%streamcluster Q=< 8, 2 > 0.0% 0.0% 0.0% 0.0%swaptions Q=< 8, 2 > 86.7% 100.0% 86.7% 93.3%
Average 50.0% 54.7% 54.0% 70.0%
Fig. 12. Percentage of the time that each policy was able to meet program QoS constraints
Program QoS Free Uniform Per Application CooperativeConstraints for All Partition Maximum Dynamic Allocation
blackscholes Q=< 2, 8 > 14.7 13.6 14.7 16.1bodytrack Q=< 2, 8 > 17.2 16.0 17.9 17.5canneal Q=< 2, 8 > 16.0 14.7 15.8 16.9streamcluster Q=< 2, 8 > 10.9 9.7 11.8 14.3swaptions Q=< 2, 8 > 20.4 18.0 20.3 21.1blackscholes Q=< 8, 2 > 14.7 13.6 14.7 15.2bodytrack Q=< 8, 2 > 17.2 16.0 17.9 17.5canneal Q=< 8, 2 > 16.0 14.7 15.8 16.9streamcluster Q=< 8, 2 > 10.9 9.7 11.8 14.4swaptions Q=< 8, 2 > 20.4 18.0 20.3 21.9
Average 15.8 14.4 16.1 17.2
Fig. 13. Average aggregate normalized throughput for each policy
Fig. 14. Blackscholes Q=< 2, 8 >
B. Comparisons to Static Policies
We did experiments to determine the ability of our models
to maximize system throughput while meeting pairs of pro-
grams’ QoS constraints. The effectiveness of our models was
compared to three static policies: 1) “free for all”, 2) uniform
partition, and 3) per-application maximum. The “free for all”
policy (policy 1) grants each program as many threads as there
are CPUs. This policy ensures that the system will be fully
utilized, at the risk of overcommitting CPUs. The uniform
partition policy (policy 2) considers the fact that pairs of
Fig. 15. Blackscholes Q=< 8, 2 >
programs will be executed. It grants each application an equal
share of the CPUs, allowing each program to create half as
many threads as CPUs. This policy avoids overcommitting the
system.
The per-application maximum policy (policy 3) takes into
account limits in program scalability. Each application is
granted as many threads as necessary to ensure that it will
reach its maximum throughput on an idle system (i.e., enough
threads to reach its performance plateau). If the program scales
continuously (i.e., a performance plateau of ∞), it is granted
as many threads as CPUs. For this policy, each program’s
performance plateau was determined in the same way as our
model generation algorithms. Therefore, this policy requires a
profiling step, whereas the other static policies do not. Figure 9
lists each program’s performance plateau. This policy may
overcommit the system.
Our policy uses the maximization function and generated
models when the system workload changes (Figures 8 and
9 respectively). We call this policy the cooperative dynamic
allocation policy. If the cooperative dynamic allocation policy
cannot find a configuration that meets QoS constraints, it uses
a configuration that, according to its utility models, maximizes
aggregate normalized throughput (i.e., the QoS constraints are
ignored).
The cooperative dynamic allocation policy performs an
exhaustive exploration of the configuration space. The space
is small enough (242 points) to permit such an approach. If
the number of programs is less than or equal to five (245
points), an exhaustive approach is still feasible and takes less
than 0.05 seconds to choose a configuration. For more than
five programs, the time required to choose a configuration
is more than one second and motivates the use of different
configuration search strategies (e.g., hill climbing).
To evaluate the policies, we performed five experiments
using the policies and two sets of QoS constraints. In each
experiment, a different program from among the five programs
was chosen. The selected program was executed during the en-
tire length of the experiment. Each experiment was broken up
into five one minute periods. In the first period, there was only
the selected program executing. In each of the later periods,
the first program was executed with a different program from
among the remaining four programs. The execution of each
program’s CPU intensive parallel region was overlapped.
Every 20 seconds, the number of work units completed
by each program was obtained. The number of work units
completed was normalized to a program’s single threaded
performance. The result is the throughput of the program over
every 20 second period.
We varied each experiment’s QoS constraints. We present
results for Q =< 2, 8 > and Q =< 8, 2 >. A constraint of
Q =< X, Y > dictates that the first program (i.e., the long
running one) in the pair should execute X times as fast as its
single-threaded performance. The second program in the pair
should execute Y times as fast. We purposefully selected Q
values that are difficult to meet.
Figure 12 shows the percentage of the time that each
policy meets the QoS constraints. Higher is better. The best
result for each experiment (row) is in bold. The streamcluster
Q =< 8, 2 > experiment is of note because no policy met its
QoS constraints. Streamcluster scales poorly and is unable to
achieve a normalized throughput above 5. Therefore, no policy
can achieve a QoS of 8.
One particularly interesting case happens in the canneal
Q =< 8, 2 > experiment. This is the eighth row of Figure 12.
Here, the “free for all policy” meets the QoS constraints only
20% of the time, while the per application policy meets QoS
constraints only 26.7% of the time. The uniform partition
policy does better by meeting QoS constraints 73.5% of the
time. The cooperative dynamic allocation policy does the best:
it satisfies the QoS constraints 93.3% of the time.
In general, the cooperative policy chooses better configu-
rations, resulting in QoS constraints being met more often.
This policy, on average, meets the QoS constraints 70.0% of
the time. The next best policy, uniform partition, meets the
constraints an average of 54.7% of the time.
We also examined the throughput achieved by each policy,
regardless of whether QoS constraints were met. By examining
the achieved throughputs and how often the policies were able
to meet QoS constraints, we can observe how conservative the
chosen configurations are.
Figure 13 shows the average aggregate normalized through-
put of each experiment. A higher value indicates a larger
throughput. The cooperative dynamic allocation policy con-
sistently outperforms the other policies. In eight out of ten
experiments, the dynamic policy has the highest average
aggregate throughput. In the other two experiments, the policy
achieves an aggregate throughput that is only 2% from the
best policy’s throughput. On average, our policy obtained a
throughput of 17.2. A conservative policy that meets the QoS
constraints would achieve an aggregate throughput of only 10
(Q1 +Q2 = 8 + 2 = 10).
To better understand the policies’ behavior, we closely
examined some representative experiments. Figures 14 to 16
show these selected results. Each graph shows a different
experiment. Because uniform partition was the best static
policy, we show a comparison between it and our cooperative
policy. The y-axis shows the aggregate normalized throughput
and the x-axis shows time. The x-axis is annotated with the
workload in each of the five periods of the experiment. If
during a 20 second interval a configuration did not result
in both programs meeting the QoS constraints, the period’s
aggregate throughput is marked as “Failed QoS.”
In Figure 14 (blackscholes Q =< 2, 8 >), the uniform
partition policy nearly ties our dynamic policy. Indeed, in two
20 second intervals our policy fails to meet the QoS constraints
even though the uniform partition policy does. One such
interval happens when executing blackscholes with bodytrack
and the other interval occurs when executing blackscholes
with swaptions. However, in the blackscholes Q =< 8, 2 >
experiment, the static uniform partition policy fails—never
meeting the QoS constraints (Figure 15). Because the uniform
partition policy uses the same configuration regardless of
the workload, it cannot adapt to different QoS constraints
or exploit program behavior. Our policy does a good job
choosing an appropriate configuration, and thus, meets the
QoS constraints 80% of the time.
Another interesting result occurs in the streamcluster Q =<
2, 8 > experiment (Figure 16). The QoS constraints dictate
that streamcluster execute twice as fast as its single-threaded
performance, and the other program (if any) run eight times
as fast (Q =< 2, 8 >). Although the dynamic policy does not
always select a configuration that meets both QoS constraints,
Fig. 16. Streamcluster Q=< 2, 8 >
Fig. 17. Blackscholes Q=< 2, 8 > Per Application QoS Results
it does a good job. Only during a single 20 second period
does the uniform partition policy outperform it (i.e., the first
20 second interval of streamcluster with bodytrack). When
the workload consists of streamcluster with swaptions (the
last 60 second period), both the static uniform partition policy
and our policy meet the QoS constraints. However, our policy
chooses a significantly better configuration, resulting in higher
aggregate throughput.
To analyze the weaknesses of our policy, we closely exam-
ined some experiments to see the reason(s) for failing to meet
QoS constraints. Figure 17 shows an analysis of blackscholes
Q =< 2, 8 >. It can be seen that the uniform partition policy
consistently meets blackscholes’s QoS constraint. However, in
two 20 second intervals our policy did not. This result suggests
that our policy overestimates blackscholes’s scalability and that
blackscholes should be allowed to create more threads. Addi-
tional threads for blackscholes likely will result in increased
throughput. Both policies failed to meet streamcluster’s QoS
constraint of 8. As discussed previously, it is impossible for
streamcluster to achieve such a throughput.
Figure 18 shows an analysis of streamcluster Q =< 2, 8 >.
Fig. 18. Streamcluster Q=< 2, 8 > Per Application QoS Results
It can be seen that when streamcluster is executed with body-
track, our policy fails to meet streamcluster’s QoS constraint
but does meet bodytrack’s constraint. The uniform partition
policy also fails to meet streamcluster’s QoS constraint two
out of three times during that period. Additionally, the uniform
partition policy fails once to meet bodytrack’s constraint dur-
ing that period. Because both policies failed to meet stream-
cluster’s QoS during that period, additional investigation may
be warranted (e.g., to find out if those programs negatively
interact in unexpected ways). These results also show that
in this experiment, our policy was able to always meet at
least one program’s QoS constraints. Conversely, the uniform
partition policy failed to meet both program’s QoS constraints
during an interval.
In summary, our cooperative dynamic allocation policy is
effective. It meets QoS constraints more often than any of
the static policies. It meets QoS constraints 28% more often
than the next best static policy (uniform partition). Our policy
simultaneously improves system throughput, by an average of
19.3% more than the next best policy (uniform partition).
VI. RELATED WORK
Techniques have been proposed to perform CPU allocation
in a variety of situations [5]–[7]. CPU allocation may be done
with or without the aid of application knowledge. Application
knowledge may be embedded in the source code (potentially
extractable later on) [8]–[11]. Knowledge may also come in
the form of programmer inserted hints [12] or gathered at run-
time, or via profiling [13]–[15]. Our work obtains application
knowledge through profiling.
Application knowledge can be used to build utility mod-
els and to make configuration decisions. Machine learning
approaches and statistical approaches are a natural fit for
building predictive and reactive algorithms that respond to
application knowledge [16]. Barnes et al., Ipek et al., and
Lee et al. used a variety of machine learning techniques (e.g.,
regression, neural networks) to predict program performance
in large dedicated clusters [17]–[19]. Unlike those authors,
our utility models are designed to predict performance on
a shared, non-dedicated CMP system, where contention for
processor time between multiple programs cannot be ignored.
Moseley et al. used linear regression and recursive partitioning
to make performance predictions about co-scheduled threads
on SMT contexts (2 contexts) [20]. Our work considers many
more contexts (24 CPUs) and makes decisions about how to
configure multiple programs simultaneously, without requiring
hardware performance counters.
VII. CONCLUSION
We described a new approach to build program utility
models. The models take into account the configuration of
both the program itself, as well as the configuration of the
system (number of other CPU-bound threads in the system,
number of CPUs in the system). We evaluated the accuracy
of several models generated by our method. We compared
the models’ ability to choose configurations that maximized
system aggregate throughput while meeting program QoS
constraints. As compared to static policies which do not
take into consideration the dynamic workload, our models
and dynamic configuration policy better met program QoS
constraints (28% more often) and increased system throughput
(19.3% higher). Therefore, dynamically using program utility
models is a viable solution to configuring application thread
usage on shared CMP machines.
ACKNOWLEDGEMENTS
This research was supported in part by the National Science
Foundation through awards CNS-1012070, CCF-0811295, and
CCF-0811352.
REFERENCES
[1] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, andK. Chang, “The case for a single-chip multiprocessor,” SIGPLAN
Not., vol. 31, no. 9, pp. 2–11, Sep. 1996. [Online]. Available:http://dx.doi.org/10.1145/248209.237140
[2] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSECbenchmark suite: characterization and architectural implications,”in Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques, ser. PACT ’08. NewYork, NY, USA: ACM, 2008, pp. 72–81. [Online]. Available:http://dx.doi.org/10.1145/1454115.1454128
[3] R. W. Moore and B. R. Childers, “Inflation and deflation of self-adaptiveapplications,” in Proceeding of the 6th international symposium on
Software engineering for adaptive and self-managing systems, ser.SEAMS ’11. New York, NY, USA: ACM, 2011, pp. 228–237.[Online]. Available: http://dx.doi.org/10.1145/1988008.1988041
[4] J. Lee, H. Wu, M. Ravichandran, and N. Clark, “Thread tailor:dynamically weaving threads together for efficient, adaptive parallelapplications,” in Proceedings of the 37th annual internationalsymposium on Computer architecture, ser. ISCA ’10. New York,NY, USA: ACM, 2010, pp. 270–279. [Online]. Available: http://dx.doi.org/10.1145/1815961.1815996
[5] S. B. Ahmad, “On improved processor allocation in 2D mesh-basedmulticomputers: controlled splitting of parallel requests,” in Proceedings
of the 2011 International Conference on Communication, Computing
& Security, ser. ICCCS ’11. New York, NY, USA: ACM, 2011,pp. 204–209. [Online]. Available: http://dx.doi.org/10.1145/1947940.1947984
[6] L. F. Leung, C. Y. Tsui, and W. H. Ki, “Minimizing energy consumptionof multiple-processors-core systems with simultaneous task allocation,scheduling and voltage assignment,” in Proceedings of the 2004 Asia
and South Pacific Design Automation Conference, ser. ASP-DAC ’04.Piscataway, NJ, USA: IEEE Press, 2004, pp. 647–652. [Online].Available: http://portal.acm.org/citation.cfm?id=1015090.1015267
[7] M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang, andO. Ozturk, “Optimizing shared cache behavior of chip multiprocessors,”in Proceedings of the 42nd Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM,2009, pp. 505–516. [Online]. Available: http://dx.doi.org/10.1145/1669112.1669176
[8] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-oriented approach to non-uniform cluster computing,” SIGPLANNot., vol. 40, pp. 519–538, Oct. 2005. [Online]. Available: http://dx.doi.org/10.1145/1094811.1094852
[9] C. E. Leiserson, “The cilk++ concurrency platform,” in Proceedings
of the 46th Annual Design Automation Conference, ser. DAC ’09.New York, NY, USA: ACM, 2009, pp. 522–527. [Online]. Available:http://dx.doi.org/10.1145/1629911.1630048
[10] Architecture Review Board, “Openmp application program interfacev3.0,” http://www.openmp.org/mp-documents/spec30.pdf.
[11] Message Passing Interface Forum, “Mpi: A message-passing inter-face standard version 2.2,” http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf.
[12] G. Chen and M. Kandemir, “Optimizing embedded applications usingprogrammer-inserted hints,” in Proceedings of the 2005 Asia and
South Pacific Design Automation Conference, ser. ASP-DAC ’05.New York, NY, USA: ACM, 2005, pp. 157–160. [Online]. Available:http://dx.doi.org/10.1145/1120725.1120794
[13] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu, and T. Nakatani,“Design and evaluation of dynamic optimizations for a java just-in-timecompiler,” ACM Trans. Program. Lang. Syst., vol. 27, no. 4, pp.732–785, Jul. 2005. [Online]. Available: http://dx.doi.org/10.1145/1075382.1075386
[14] M. C. Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos,“Online power-performance adaptation of multithreaded programs usinghardware event-based prediction,” in Proceedings of the 20th annual
international conference on Supercomputing, ser. ICS ’06. NewYork, NY, USA: ACM, 2006, pp. 157–166. [Online]. Available:http://dx.doi.org/10.1145/1183401.1183426
[15] M. Itzkowitz and Y. Maruyama, “HPC profiling with the sun studioperformance tools,” in Tools for High Performance Computing 2009,M. S. Muller, M. M. Resch, A. Schulz, and W. E. Nagel, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, ch. 6, pp. 67–93.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-11261-4\ 6
[16] J. F. Martinez and E. Ipek, “Dynamic multicore resource management:A machine learning approach,” IEEE Micro, vol. 29, no. 5, pp. 8–17,Sep. 2009. [Online]. Available: http://dx.doi.org/10.1109/MM.2009.77
[17] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski,and M. Schulz, “A regression-based approach to scalability prediction,”in Proceedings of the 22nd annual international conference on
Supercomputing, ser. ICS ’08. New York, NY, USA: ACM, 2008,pp. 368–377. [Online]. Available: http://dx.doi.org/10.1145/1375527.1375580
[18] E. Ipek, B. de Supinski, M. Schulz, and S. McKee, “An approachto performance prediction for parallel applications Euro-Par 2005parallel processing,” in Euro-Par 2005 Parallel Processing, ser. LectureNotes in Computer Science, J. Cunha and P. Medeiros, Eds. Berlin,Heidelberg: Springer Berlin / Heidelberg, 2005, vol. 3648, ch. 24, pp.627–628. [Online]. Available: http://dx.doi.org/10.1007/11549468\ 24
[19] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh,and S. A. McKee, “Methods of inference and learning for performancemodeling of parallel applications,” in Proceedings of the 12thACM SIGPLAN symposium on Principles and practice of parallel
programming, ser. PPoPP ’07. New York, NY, USA: ACM, 2007,pp. 249–258. [Online]. Available: http://dx.doi.org/10.1145/1229428.1229479
[20] T. Moseley, D. Grunwald, J. L. Kihm, and D. A. Connors, “Methodsfor modeling resource contention on simultaneous multithreadingprocessors,” Computer Design, International Conference on, vol. 0, pp.373–380, 2005. [Online]. Available: http://dx.doi.org/10.1109/ICCD.2005.74