probabilistic auto-tuning for architectures with complex ...eters exp osed to auto-tuning en...

12
Probabilistic Auto-Tuning for Architectures with Complex Constraints Benjamin Ylvisaker Scott Hauck GrammaTech, Inc. * University of Washington [email protected] [email protected] ABSTRACT It is hard to optimize applications for coprocessor accelerator architectures, like FPGAs and GPUs, because application parameters must be tuned carefully to the size of the target architecture. Moreover, some combinations of parameters simply do not work, because they lead to overuse of a con- strained resource. Applying auto-tuning—the use of search algorithms and empirical feedback to optimize programs—is an attractive solution, but tuning in the presence of unpre- dictable failures is not addressed well by existing auto-tuning methods. This paper describes a new auto-tuning method that is based on probabilistic predictions of multiple program fea- tures (run time, memory consumption, etc.). During config- uration selection, these predictions are combined to balance the preference for trying configurations that are likely to be high quality against the preference for trying configurations that are likely to satisfy all constraints. In our experiments, our new auto-tuning method performed substantially better than the simpler approach of treating all failed configura- tions as if they succeed with a “very low” quality. In many cases, the simpler strategy required more than twice as many trials to reach the same quality level in our experiments. Categories and Subject Descriptors D.3.4 [Programming languages]: Processors General Terms Optimization, probabilistic Keywords Auto-tuning, accelerator architectures 1. INTRODUCTION 0 Benjamin was at the University of Washington when he did the work described in this paper. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EXADAPT ’11 June 5, 2011, San Jose, California, USA. Copyright 2011 ACM 978-1-4503-0708-6/11/06 ...$10.00. Tuning is the process of adapting a program to a par- ticular target architecture or class of target architectures. Automatic and semi-automatic tuning has been a topic of interest in the high performance computing community for many years, and tuning for embedded systems and even gen- eral purpose processors has been growing in importance re- cently. Cohen, et al.[5], argue that one technology trend driving interest in auto-tuning is the widening gap between true peak performance and what is typically achieved by conventional compiler optimization. The central problem is that architectures are so complex that for most intents and purposes it is impossible to accurately model the effects of optimizations or configuration changes on the performance of a program. This paper focuses on parallel coprocessor accelerators, like field programmable gate arrays (FPGAs) and general purpose graphics processing units (GPGPUS), 1 for which tuning is extremely important for achieving high perfor- mance. Accelerators have many explicitly managed resources, like distributed memories, non-uniform local networks and memory interfaces, that applications must use well to achieve good performance. Adjusting algorithm-level parameters, like loop blocking factors, is an important part of this tun- ing process and it is hard and tedious to tune by hand. Nevertheless, manual tuning is still common. Figure 1 gives an intuitive sense for the complex optimiza- tion spaces that all auto-tuning methods have to contend with. The plateaus, sharp cliffs and chaotic behavior make simple search strategies like hill climbing ineffective. Accelerators present an additional challenge for automatic tuning: they have relatively poor architectural support for graceful performance degradation. If a particular configura- tion of a program needs to use more local data or instruction memory than is available, the program will simply fail to compile or run properly. Thus, tuning for accelerators com- bines searching for high values of a quality function (e.g. per- formance) with satisfying a number of resource constraints. Conventional approaches to auto-tuning focus on just the quality function. Quality-only methods can be adapted by giving a default “very bad” score to configurations that vio- late some constraint. However, our results provide evidence that this is not an effective strategy. This paper describes a new auto-tuning method that is de- signed to address the constraint satisfaction problem. The primary novel feature of our search algorithm is that it uses 1 Our experiments are geared to FPGA-like architectures, though we believe the auto-tuning methods described in this paper are applicable to a wider class of architectures.

Upload: others

Post on 12-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

Probabilistic Auto-Tuning for Architectures with Complex Constraints

Benjamin Ylvisaker Scott HauckGrammaTech, Inc.∗ University of [email protected] [email protected]

ABSTRACTIt is hard to optimize applications for coprocessor acceleratorarchitectures, like FPGAs and GPUs, because applicationparameters must be tuned carefully to the size of the targetarchitecture. Moreover, some combinations of parameterssimply do not work, because they lead to overuse of a con-strained resource. Applying auto-tuning—the use of searchalgorithms and empirical feedback to optimize programs—isan attractive solution, but tuning in the presence of unpre-dictable failures is not addressed well by existing auto-tuningmethods.

This paper describes a new auto-tuning method that isbased on probabilistic predictions of multiple program fea-tures (run time, memory consumption, etc.). During config-uration selection, these predictions are combined to balancethe preference for trying configurations that are likely to behigh quality against the preference for trying configurationsthat are likely to satisfy all constraints. In our experiments,our new auto-tuning method performed substantially betterthan the simpler approach of treating all failed configura-tions as if they succeed with a “very low” quality. In manycases, the simpler strategy required more than twice as manytrials to reach the same quality level in our experiments.

Categories and Subject DescriptorsD.3.4 [Programming languages]: Processors

General TermsOptimization, probabilistic

KeywordsAuto-tuning, accelerator architectures

1. INTRODUCTION0Benjamin was at the University of Washington when he didthe work described in this paper.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.EXADAPT ’11 June 5, 2011, San Jose, California, USA.Copyright 2011 ACM 978-1-4503-0708-6/11/06 ...$10.00.

Tuning is the process of adapting a program to a par-ticular target architecture or class of target architectures.Automatic and semi-automatic tuning has been a topic ofinterest in the high performance computing community formany years, and tuning for embedded systems and even gen-eral purpose processors has been growing in importance re-cently. Cohen, et al.[5], argue that one technology trenddriving interest in auto-tuning is the widening gap betweentrue peak performance and what is typically achieved byconventional compiler optimization. The central problem isthat architectures are so complex that for most intents andpurposes it is impossible to accurately model the effects ofoptimizations or configuration changes on the performanceof a program.

This paper focuses on parallel coprocessor accelerators,like field programmable gate arrays (FPGAs) and generalpurpose graphics processing units (GPGPUS),1 for whichtuning is extremely important for achieving high perfor-mance. Accelerators have many explicitly managed resources,like distributed memories, non-uniform local networks andmemory interfaces, that applications must use well to achievegood performance. Adjusting algorithm-level parameters,like loop blocking factors, is an important part of this tun-ing process and it is hard and tedious to tune by hand.Nevertheless, manual tuning is still common.

Figure 1 gives an intuitive sense for the complex optimiza-tion spaces that all auto-tuning methods have to contendwith. The plateaus, sharp cliffs and chaotic behavior makesimple search strategies like hill climbing ineffective.

Accelerators present an additional challenge for automatictuning: they have relatively poor architectural support forgraceful performance degradation. If a particular configura-tion of a program needs to use more local data or instructionmemory than is available, the program will simply fail tocompile or run properly. Thus, tuning for accelerators com-bines searching for high values of a quality function (e.g. per-formance) with satisfying a number of resource constraints.Conventional approaches to auto-tuning focus on just thequality function. Quality-only methods can be adapted bygiving a default “very bad” score to configurations that vio-late some constraint. However, our results provide evidencethat this is not an effective strategy.

This paper describes a new auto-tuning method that is de-signed to address the constraint satisfaction problem. Theprimary novel feature of our search algorithm is that it uses

1Our experiments are geared to FPGA-like architectures,though we believe the auto-tuning methods described in thispaper are applicable to a wider class of architectures.

Page 2: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

1 2 3 4 5 6 7 8

0 10 20 30 40 50 60 70 80

0 5

10 15

20 25

30 35

1 2 3 4 5 6 7 8

Runtime

Parameter Interaction (Tiling and Unrolling for MM, N=800)

Tile Size

Unroll Amount

Runtime

Figure 1. Parameter Search Space for Tiling and Unrolling (Figure is easier to see in color).

three categories: automatically-generated libraries,compiler-generated code and application-level param-eters exposed to auto-tuning environments. Thus, ap-plications of the future will demand a cohesive envi-ronment that can seamlessly combine these di!erentkinds of auto-tuning software and that employs scal-able empirical search to manage the cost of the searchprocess.

In this paper, we take an important step in thedirection of building such an environment. We be-gin with Active Harmony [8], which permits applica-tion programmers to express application-level param-eters, and automates the process of searching amonga set of alternative implementations. We combineActive Harmony with CHiLL [5], a compiler frame-work that is designed to support convenient auto-matic generation of code variants and parametersfrom compiler-generated or user-specified transforma-tion recipes. In combining these two systems, we haveproduced a unique and powerful framework for auto-tuning compiler-generated code that explores a richerspace than compiler-based systems are doing today andcan empower application programmers to develop self-tuning applications that include compiler transforma-tions.

A unique feature of our system is a powerful paral-lel search algorithm which leverages parallel architec-tures to search across a set of optimization parametervalues. Multiple, sometimes unrelated, points in thesearch space are evaluated at each timestep. With thisapproach, we both explore multiple parameter inter-actions at each iteration and also have di!erent nodesof the parallel system evaluate di!erent configurationsto converge to a solution faster. In support of this

search process, CHiLL provides a convenient high-levelscripting interface to the compiler that simplifies codegeneration and varying optimization parameter values.

The remainder of the paper is organized into fivesections. The next section motivates the need for an ef-fective search algorithm to explore compiler generatedparameter spaces. Section 3 describes our search algo-rithm, which is followed by a high-level description ofCHiLL in section 4. In section 5, we give an overviewof the tuning workflow in our framework. Section 6presents an experimental evaluation of our framework.We discuss related work in section 7. Finally, section8 will provide concluding remarks and future implica-tions of this work.

2 Motivation

Today’s complex architecture features and deepmemory hierarchies require applying nontrivial opti-mization strategies on loop nests to achieve high per-formance. This is even true for a simple loop nestlike Matrix Multiply. Although naively tiling all threeloops of Matrix Multiply would significantly increaseits performance, the performance is still well belowhand-tuned libraries. Chen et al [7] demonstrate thatautomatically-generated optimized code can achieveperformance comparable to hand-tuned libraries by us-ing a more complex tiling strategy combined with otheroptimizations such as data copy and unroll-and-jam.Combining optimizations, however, is not an easy taskbecause loop transformation strategies interact witheach other in complex ways.

Di!erent loop optimizations usually have di!erent

Authorized licensed use limited to: University of Washington Libraries. Downloaded on September 24, 2009 at 21:05 from IEEE Xplore. Restrictions apply.

Figure 1: A tuning space for a matrix multiplica-tion kernel. The run time function is not generallysmooth, which makes tuning a challenge. In particu-lar, there are flat “plateaus”, sharp “cliffs” and mul-tiple “troughs” with near-optimal values. (Graphicborrowed from [18].)

probabilistic predictions of several program features, thencombines these predictions to calculate an overall score foruntested configurations. We assume that the programmer(or some high-level program generator) has declared a num-ber of tuning knobs, for which the auto-tuner will discovergood values. We sometimes call tuning knobs parameters,and a complete set of tuning knob values for an application isa configuration. A number programming language-level in-terfaces to auto-tuning systems have been proposed recently[1, 2, 11, 16, 24, 25, 27].

Failures (constraint violations) make auto-tuning morechallenging because it is no longer sufficient to optimize asingle quality function. It is possible to define the quality ofall failing configurations to be “very low”. However, thereare two important weaknesses to this simple approach tofailures:

• If a large portion of all possible configurations fail, itcan be hard for a tuning search to find any regions ofthe space where there are successes, because all failuresare equally bad.• The highest quality configurations are often very close

to failing configurations, because it is usually best touse up all the available resources without oversubscrib-ing them. Thus it is likely that smart auto-tuning algo-rithms for accelerators will spend a lot of time explor-ing the border between successful and failing configu-rations. Understanding why some configurations failcan help the search choose better sequences of config-urations to test.

Using a probabilistic framework for tuning has some ad-ditional benefits that are not necessarily limited to accelera-tors. We will discuss these throughout the paper. There is agreat deal of prior related work, both in tuning application-level parameters, as well as tuning compilers and architec-tures. We discuss the relationships between this paper andprior work in Section 9.

2. OVERVIEW OF THE TUNING KNOBSMETHOD

There are two basic ingredients required to use the tuningknob system:

voidfir1(int*I,int*O,int*C,intN,intNC){for(j=0;j<N;j++){O[j]=0;for(k=0;k<NC;k++){O[j]+=I[j+k]*C[k];}}}

1234567

Figure 2: A simple generic FIR filter.

• A real-valued optimization formula written by the pro-grammer, with program features as the variables andsimple arithmetic like addition and multiplication.• A set of Boolean-valued constraint formulas, some of

which are written by the programmer (e.g., energy con-sumed less than some application-defined limit) andsome of which are provided as part of the system im-plementation (e.g., memory usage less than capacityof the target architecture).

The tuning process involves iteratively selecting and test-ing configurations until some stopping criterion is met. Thesearch algorithm has to make predictions about which untestedpoint is most likely to both have a good value for the op-timization formula and satisfy all the constraint formulas.To our knowledge, this paper describes the first application-level auto-tuning method that uses probabilities and proba-bility distributions to represent predictions about the valuesof program features, the likelihood of meeting constraints,and the likelihood of having a “good”value for the optimiza-tion formula. Casting the problem in probabilistic terms isuseful because we can use rich statistical math to combinemany competing factors.

3. AN EXAMPLE APPLICATIONTo see how tuning knob are used in the applications we ex-

perimented with, consider the finite impulse response (FIR)filter in Figure 2. This is a simple sequential implementationof the algorithm, which assigns to each output location thesum of values in a window of the input array, scaled by thevalues in a coefficient array.

There is abundant potential parallelism in this applica-tion. All N× NC multiplications are completely independent,and the N×NC additions are N independent reductions of sizeNC.

Adapting a FIR filter for high-performance execution onan accelerator requires making a number of implementationchoices. The inner loop should be parallelized, but completeparallelization is unrealistic, if NC is large. We assume thatNC is large enough that the coefficient array will have to bebroken up into banks and distributed around the local mem-ories of the accelerator. Different accelerators have memorystructures that support different access patterns. Thus weassume that the loop is partially parallelized, controlled bya tuning knob that we will call “Banks”.

For the purpose of the tuning method presented in this pa-per, it is not important whether these high-level structuralchanges are performed by a human or a domain-specific pro-gram generator. For our experiments we wrote code with ex-plicit tuning knobs by hand. A more detailed developmentof the example can be found in [25].

In this small example, the optimization formula is simplythe run time, which should be minimized. The constraints

Page 3: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

are all architecture-specific, and would be provided by thecompiler. The constraints are described in Section 8.

4. PROBABILISTIC AUTO-TUNINGIn this section, we describe the most important features

of our auto-tuning method in a top-down style. There aremany implementation details in the system, and many ofthem have a number of reasonable alternatives. We willexplicitly point out what parts are essential to the function-ing of the system and what parts could be replaced withoutfundamentally altering it.

In theory, at each step in the search process the algorithmis trying to find the untested configuration (or candidate) cthat maximizes the following joint probability formula:

P(c is the highest quality2 configuration∩ c satisfies all constraints)

Analyzing the interdependence of the quality and con-straint features of a tuning space is hard given the rela-tively small amount of training data that auto-tuners typ-ically have to work with. There certainly are such inter-dependences, but we found in our experimentation that itworks reasonably well to assume that the two factors arecompletely independent. With this independence assump-tion we can factor the joint probability into the product oftwo simpler probabilities.

P (c is the highest quality configuration)×P (c satisfies all constraints)

A successful tuning algorithm must maintain a balancebetween focusing the search in the areas that seem mostpromising versus testing a more evenly distributed set ofconfigurations to get a good sampling of the space. We dis-covered a nice technical trick that helps maintain this bal-ance: instead of predicting the probability that a candidateis the very highest quality, we predict the probability thatthe quality of a candidate is higher than an adjustable tar-get quality (qt). Selection of the target quality is addressedin Section 7.1; the quality of the best configuration testedso far is a good starting point.

P (quality(c) > qt)×P (c satisfies all constraints)

Next we consider how to compute the joint probabilitythat a configuration satisfies all constraints. Ideally, thesystem would be able to model the correlation between dif-ferent constraints and use them to predict the joint proba-bility of satisfying all constraints. Unfortunately, the smallnumber of configurations that auto-tuning systems gener-ally test provide very little training data for these kinds ofsophisticated statistical analyses. However, there are oftenstrong correlations between different constraints, since mostof them relate to overuse of some resource, and a knob thatcorrelates with one kind of resource consumption often cor-relates with consumption of other kinds of resources. Inour experiments we found that using the minimum proba-bility of success across all constraints worked well. This isan optimistic simplification; if we assume instead that allconstraints are completely independent, using the product

2For simplicity of presentation we assume that the optimiza-tion formula specifies that high values are good.

of the individual probabilities would be appropriate.

P (quality(c) > qt)×MinN∈constraints

`P (c satisfies constraint N)

´At each step in the tuning process, the system attempts to

find the untested configuration that maximizes this formula.This formula is complex enough that it is not clear thatthere is an efficient way to solve precisely for the maximum.Instead our tuning algorithm chooses a pseudorandom setof candidates, evaluates the formula on each one, and teststhe one with the highest probability.

To evaluate this formula, we need probabilistic predic-tions for the program features that determine quality andconstraint satisfaction. We call raw features like the runtime or energy consumption of the program sensors. Sen-sors can be combined with arithmetic operations to makederived features, like run time-energy product.

For each candidate point and each sensor, the tuning knobsearch algorithm produces a predicted value distribution forthat sensor at the given point. We use normal (Gaussian)distributions, because as long as we assume that the featuresare independent, we can combine normal distributions witharithmetic operations to produce predictions for the derivedfeatures that are also normal. Derived features are discussedin more detail in Section 6.

Aside. In our experiments, we made the simplifying as-sumption that a particular configuration will always havethe same values for its features (run time, memory use,. . . ) if it is compiled and tested multiple times. In otherwords, we assume that the system we are tuning behavesdeterministically. This is clearly a major simplification, andaccommodating random system variation is an interestingand important direction for future work. It is possible thatprobabilistic predictions, as implemented in the tuning knobsearch, will be a useful framework for addressing the randomvariation problem as well.

4.1 Hard to predict constraintsOne of the trickiest issues left to deal with is deciding what

the constraint formula should be for some failure modes.The easy failures are those for which some program featurecan be used to predict failure or success, and it is possibleto get a value for that feature for both failed and successfultests. For example, the programmer can impose the con-straint that the program cannot use more than a certainamount of energy. Every test can report its energy use, andthese reported values can be used to predict the energy useof untested configurations.

The harder failures are those for which the most logicalfeature for predicting the failure does not have a definedvalue at all for failing tests. For example, consider an appli-cation where some tuning knobs have an impact on dynamicmemory allocation, and a non-trivial range of configurationsrequire more memory than is available in the system. It ispossible to record the amount of memory used as a programfeature, but for those configurations that run out of memoryit is not easy to get the value we really want, which is howmuch memory would this configuration have used if it hadsucceeded.

Another constraint of the nastier variety is compile time.Full compilation for FPGAs and other accelerators can takehours or even days, and it can be especially high when theresource requirements of the application are very close to

Page 4: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

Constraint metric

Value for a configuration that did not satisfy the constraint

Value for a configuration that satisfied the constraint

Set of values whose mean and variance define the cutoff that will be used to predict failure for this constraint

(a)

(b)

Cutoff distributions

Figure 3: Two examples of setting the cutoff valuefor some constraint. The portion of a candidate’spredicted value that is less than the cutoff deter-mines what its predicted probability of passing thisconstraint will be. Values of this metric for testedconfigurations are represented as red x’s and blueo’s. Tested configurations that cannot be said tohave passed or failed this constraint (because theyfailed for some other reason) are not representedhere at all. Note that if there is some overlap inthe constraint metric between cases that passed andcases that failed, the cutoff will be a distribution,not a scalar value; this is fine: the result of com-paring two normal distributions with a greater-thanor less-than operator can still be interpreted as asimple probability.

the limits of the architecture. For this reason it is commonto impose a time limit on compilation. Like the memory us-age example, failing configurations do not tell us how muchof the relevant resource (compile time) would have been re-quired to make the application work.

To compute predictions for the probability of satisfyingthe harder constraints, we use proxy constraints that theprogrammer or compiler writer believes correlate reasonablywell with the “real” constraint, but for which it is possibleto measure a value in both successful and failed cases. Anexample of a proxy constraint from our experiments is thesize of the intermediate representation (IR) of a kernel asa proxy for hitting the time limit during compilation. Thisis not a perfect proxy in the sense that some configurationsthat succeed will be larger than some that cause a time limitfailure. This imperfection raises the question of what the IRsize limit should be for predicting a time limit failure.

To set the limit for constraint f with proxy metric p, weexamine all tested points. If a configuration failed constraintf , its value for metric p is recorded as a failed value. If aconfiguration succeeded, or even got far enough to provethat it will not fail constraint f , its value for metric p isrecorded as a successful value. Configurations that failed insome other way that does not determine whether they wouldhave passed f or not are not recorded.

The failed and successful values are sorted, as indicatedin Figure 3. The cutoff region is considered to be every-thing from the lowest failed value up to the lowest failedvalue that is higher than the highest successful value. Inthe special case of no overlap between successful and failed

values, the cutoff region is a single value. The cutoff is thencomputed as the mean and variance of the values in thecutoff region. Since the system is already using normal dis-tributions to model the predictions for all real values, it iscompletely natural to compare this distribution with the IRsize prediction distribution to compute the probability ofhitting the compiler time limit.

There are many other strategies that could be used topredict the probability of a candidate satisfying all con-straints. For example, classification methods like supportvector machines (SVMs[17]) or neural networks could proveeffective. The classical versions of these methods produceabsolute predictions instead of probabilistic predictions, butthey have been extended to produce probabilistic predictionsin a variety of ways. Also, the intermingling of successfuland failing configurations (as opposed to a clean separationbetween the classes) is a challenge for some conventionalclassification methods.

5. PROBABILISTIC REGRESSION ANAL-YSIS

At the heart of our tuning knob search method is prob-abilistic regression analysis. Regression analysis is the pre-diction of values of a real-valued function at unknown pointsgiven a set of known points. Probabilistic regression analysisproduces a predicted distribution, instead of a single value.Classical regression analysis is a very well-studied problem.Probabilistic regression analysis has received less attention,but there are a number of existing approaches in the ap-plied statistics and machine learning literature. One of themost actively studied methods in recent years is Gaussianprocesses (GPs[15]).

Probabilistic regression analysis has been used to solve anumber of problems (e.g., choosing sites for mineral extrac-tion), but we are not aware of any auto-tuning algorithmsthat use it. The regression analysis needed for auto-tuning issomewhat different from the conventional approaches. Mostexisting approaches to probabilistic regression require a priordistribution, which is an assumption about the shape of thefunction before any training data have been observed. Theseassumptions are usually based on some formal or informalmodel of the system being measured. Auto-tuners generallyhave a priori no way to know the shape of the function forsome program feature.

We must, however, make some assumptions about thecharacteristics of the underlying functions. Without anyassumptions, it is impossible to make predictions; any valuewould be equally likely. We designed our own relativelysimple probabilistic regression method based on the assump-tion that local linear averaging and linear projections fromlocal slope estimates are good guides for predicting valuesof untested configurations. The effect of these assumptionsis that our regression analysis produces more accurate pre-dictions for features that can be modeled well by piecewiselinear functions.

To keep it as clear as possible, the initial description ofthe complete tuning knob search method uses simplistic im-plementations for some subcomponents. More sophisticatedalternatives are described in Section 7.

Throughout this section we use one-dimensional visual-izations to illustrate the mathematical concepts. The mathitself generalizes to an arbitrary number of dimensions.

Page 5: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

valu

e

knob

Local linear interpolation

Local linear derivative

Tested configurations

Figure 4: A comparison of local linear averagingand derivative projection. Averaging is “safe” in thesense that it never produces predictions outside therange of observed values. However, plain averagingdoes not follow the trends in the data, and so pro-duces predictions that do not seem intuitively rightwhen the tested values seem to be“pointing” in somedirection.

5.1 Averaging tested valuesThe first step in calculating a candidate’s distribution is a

local linear averaging. For some candidate point ~p, interp~p

is the weighted average3 of the values of the points thatare neighbors of ~p, where the weight for neighbor ~n is theinverse of the distance between ~p and ~n. This model has twocomponents that require definition: distance and neighbors.

5.1.1 DistanceThe distance between two points is composed of individual

distances between two settings on each knob, and a methodfor combining those individual distances. For continuousand discrete range knobs, we initially assume that the dis-tance between two settings is just the absolute differencebetween their numerical values.

We combine the individual distances by summing them(i.e., we use the Manhattan distance). Euclidean distancecan be used as well; in our experiments, the impact of thedifference between Manhattan and Euclidean distances onfinal search effectiveness was small.

5.1.2 NeighborsThere are many reasonable methods for deciding which

points should be considered neighbors of a given point. Forthe initial description, we will assume that a point’s neigh-bors in some set are the k nearest points in that set, where wechoose k to be 2 times the number of tuning knobs (i.e. di-mensions) in the application. The intuition for this k valueis that if the tested configurations are evenly distributed,most points will have one neighbor in each direction. Thisdefinition of neighbors performed reasonably well in prelim-inary experiments, but has some weaknesses. In Section 7.2we give a more sophisticated alternative.

5.2 Derivative-based projectionAveraging is important for predicting the value of a func-

tion, but it does not take trends in the training data intoaccount at all, as illustrated in Figure 4. In order to take

3Any weighted averaging method works (arithmetic, geo-metric, etc.); we used the arithmetic mean.

Interpolation

Extrapolation lines

Neighbor points

knob setting

valu

e Candidate point Values used to computemean and standard

deviation

Figure 5: The basic ingredients that go into theprobabilistic regression analysis used in the tuningknob search. The distribution for a given candidateconfiguration is the weighted mean and standard de-viation of the averaged value between neighboringtested points (black line) and projected values fromthe slope at the neighbors (dashed blue lines).

trends in the data into account, we add a derivative-basedprojection component to the regression analysis. In a sense,projection is actually serving two roles: (1) it helps make thepredictions follow our assumption that functions are piece-wise linear; (2) it helps identify the regions where there is alot of uncertainty about the value of the function.

For each candidate point ~c we produce a separate projec-tion from each of the neighbors of ~c. We do this by esti-mating the derivative in the direction of ~c at each neighbor~n. The derivative estimate is made by using the averagingmodel to estimate the value of the point ε distance from ~nin the opposite direction from ~c.

~d = ~n +ε

dist(~c, ~n)(~n− ~c)

We use the averaging model to calculate a value for ~d, whichgives us a predicted derivative at ~n.

derivative at ~n towards ~c =value(~n)− interp(~d)

ε

Finally to get the value for ~c predicted from ~n we projectthe derivative back at ~c.

extrap(~c, ~n) = value(~n) + dist(~c, ~n)×derivative(~n,~c)

A useful property of this projection method is that it takesinto account the values of points farther from ~c than itsimmediate neighborhood; in a sense expanding the set oflocal points that influence the prediction.

Figure 5 illustrates averaging between two tested pointsand derivative projection from the same points. Three dif-ferent values are generated for the candidate configuration(the dotted vertical line); the mean and variance of thesevalues become the predicted distribution for this candidatefor whatever feature we are currently working with.

5.3 Predicted distributionThe final predicted distribution for each candidate point

~c is the weighted mean and variance of its distance-weightedaverage, and projected values from all its neighbors to com-pute its predicted value distribution. The selection of the

Page 6: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

Mean

± 1 standard deviation

± 2 standard deviations

knob setting

valu

e

Figure 6: An illustration of the distributions thatproduced by the regression method presented here.Notice that the variance is much higher farther awayfrom tested points. Also the slopes at the neighbors(blue lines) “pull” the mean line up or down. Fi-nally, the “best” configuration to test next dependson the relative importance given to high mean qual-ity versus large variance.

weights is important, and we use a “gravitational” model,where the weight on each projected value is the square ofthe inverse distance from the neighbor that we used to cal-culate that projection (w~n = 1

dist(~p,~n)2). The weight on the

averaged value is equal to the sum of the weights on all theprojected values. In other words, the averaging is as impor-tant as all of the projections combined. So the complete setof weights and values used to compute the predicted distri-bution is: n`

(P

n∈neighbors wn), interp(~c)´o Sn`

wn, extrap(~c, ~n)´˛

n ∈ neighborso

Observe that the variance of this set will be large when thevalues projected from each of the neighbors are differentfrom each other and/or the averaged value.

Figure 6 shows what the predicted distributions wouldlook like for the whole range of candidates between twotested points.

5.4 Target qualityInitially we assume that the target is simply the quality

of the best configuration found so far; the high-water mark.Figure 7 illustrates how candidates’ quality predictions arecompared with the target. In this example, the highest qual-ity tested point is not shown.

6. DERIVED FEATURESAn important part of the tuning knob search method is

that predictions for sensors (features for which raw data iscollected during configuration testing) can be combined ina variety of ways to make derived feature predictions. Avery simple example of a derived feature is the product ofprogram run time and energy consumption. It is possibleto compute a run time-energy product value for each testedpoint, and then run the regression analysis directly on thosevalues. However, by running the regression analysis on theconstituent functions (run time, energy), then combining the

Scor

e

Knob setting

Distribution ofinterpolated

andextrapolated

values

Target quality

Probability ofbetter than thetarget quality

Figure 7: An illustration of randomly selected can-didate configurations (dotted vertical lines), qualitypredictions for those configurations (normal distri-butions), the current target (dashed line), and theprobability of a candidate being better than the tar-get (dark portions of the distributions).

predicted distributions mathematically we can sometimesmake significantly more accurate predictions.

A slightly more complex example derived feature is anapplication with two sequenced loops nested within an outerloop. We can measure the run time of the whole loop nest asa single sensor, or we can measure the run time of the innerloops separately and combine them into a derived featurefor the run time of the whole loop nest. If the adjustment ofthe tuning knobs in this application trade off run time of thetwo inner loops in some non-trivial way, it is more likely thatwe will get good predictions for the individual inner loops.The individual effects of the knobs on the inner loops areconflated together in the run time of the whole loop nest,which makes prediction harder.

An example of a derived feature that is used in our exper-iments is the proxy metric for compiler time limit violations.The proxy metric combines a number of measures of inter-mediate representation size and complexity; each measureis predicted in isolation, then the predictions are combinedusing derived features.

Each mathematical operator that we would like to use tobuild derived features needs to be defined for basic values(usually simple) and distributions of values. The simplestoperators are addition and subtraction. The sum of two nor-mal distributions (for example the predicted run times forthe two inner loops in our example above) is a new normaldistribution whose mean is the sum of the means of the inputdistributions and whose standard deviation is the sum of theinput standard deviations. This definition assumes that theinput distributions are independent, which is a simplifyingassumption we make for all derived features.

Multiplication and division are also supported operatorsfor derived features. Unfortunately, multiplying and divid-ing normal distributions does not result in distributions thatare precisely normal. However, as long as the input distribu-tions are relatively far from zero, the output distribution canbe closely approximated with a normal distribution. We use

Page 7: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

1 Basic tuning knob search2 Initialize T , the set of results, to empty3 Select a config. at random; test it; add results to T4 while (termination criteria not met)5 Compute quality target, qt

6 Do pre-computation for regression analysis(e.g. build neighborhood)

7 Repeat N times (100 in our experiments)8 Select an untested config. c at random9 Perform regression analysis at point c for

all sensors10 Evaluate derived features at point c11 pSuccess ← 112 ∀f ∈ failure modes13 pSuccess ← min

`pSuccess,

P (feature linked to f)´

14 score(c) = P (quality(c) > qt) × pSuccess15 Test best candidate, add results to T

Figure 8: The complete basic tuning knob searchalgorithm.

such an approximation, and it is currently left to the user(either the programmer or the compiler writer) to decidewhen this might be a problem. For details of all supportedderived feature operations, see [25].

The complete basic tuning knob algorithm is shown inFigure 8.

7. ENHANCEMENTSAs described so far, the search method worked reasonably

well in our preliminary experiments, but we found a numberof ways to improve its overall performance and robustness.The main quantitative result in the evaluation section willshow the large difference between our search method with allthe refinements versus using the approach that treats all fail-ing configurations as having a very low quality. Comparedto the large difference between sophisticated and simplisticfailure handling, the refinements in the following sectionshave a relatively small impact.

Due to space constraints, we cannot fully cover all of thetopics explored in [25]. In particular, the following items weonly mention briefly.

• Constraint scaling. We developed a method fordynamically adjusting constraint violation predicationprobabilities based on observed failure rates.• Termination criteria. In our experiments all searches

tested 50 configurations.• Concurrent testing. It is relatively easy to accom-

modate concurrent testing in our tuning framework.• Boundary conditions. The edges of the configu-

ration space are treated specially to avoid boundaryeffects.• Distance scaling. It is useful to scale the relative

distance of different parameters according to how bigof an impact they have on program features.

7.1 Target qualityGiven a predicted quality distribution for several candi-

dates, it is not immediately obvious which is the best to testnext. This is true even if we ignore the issue of failures en-tirely. Some candidates have a higher predicted mean and

smaller variance, whereas some have a larger variance andlower mean. This is a question of how much“risk” the searchalgorithm should take, and there is no simple best answer.The strategy we use is to compute the probability that eachcandidate’s quality is greater than some target, which is dy-namically adjusted as the search progresses.

The simplest method for choosing the target that workedwell in our experiments is using the maximum quality overall the successful tested configurations. There is no rea-son that the target has to be exactly this high-water mark,though, and adjusting the target is a very effective way ofcontrolling how evenly distributed the set of tested points is.The evenness of the distribution of tested points is an inter-esting metric because the ideal distribution of tested pointsis neither perfectly evenly distributed nor too narrowly fo-cused. Distributions that are too even waste many searchesin regions of the configuration space that are unlikely tocontain good points. Distributions that are too uneven runa high risk of missing good points by completely ignoringentire regions of the space.

To keep the set of tested points somewhat evenly dis-tributed the target is adjusted up (higher than the high-water mark) when the distribution is getting too even andadjusted down when the distribution is getting too uneven.Higher targets tend to favor candidates that have larger vari-ance, which are usually points that are farther from anytested point, and testing points in “empty space” tends toeven out the distribution.

There are many ways to measure the evenness of the dis-tribution of a set of points, including [8] and [12]. See [25]for the details of our target quality adjustment heuristic.

7.2 NeighborhoodsWhen the distribution of a set of points is fairly uneven,

the simple k-nearest and radius δ hypersphere definitionsof neighbor pairs (illustrated in Figure 9) do not capturesome important connections. In particular, points that arerelatively far apart but do not have any points in betweenthem might not be considered neighbors, because there areenough close points in other directions. To get a betterneighbor connection graph, we use a method similar to theelliptical Gabriel graph described in [13]. Two points areconsidered neighbors as long as there does not exist a thirdpoint in a “region of influence” between them. For details,see [25].

8. EVALUATIONComparing the tuning knob search against existing auto-

tuning approaches is problematic because the main motiva-tion for developing a new search method was handling hard-to-predict failures, and we are not aware of any other auto-tuning methods that address this issue. As evidence that ourfailure handling is effective, we compare the complete tun-ing knob algorithm against the same algorithm, but with theconstraint/failure prediction mechanisms turned off. Pointsthat fail are simply given the quality value of the lowestquality configuration found so far.4

As a basic verification that the tuning knob search algo-rithm selects a good sequence of configurations to test, we

4We also tried making the score for failing points lower thanthe lowest found so far. The results were not significantlydifferent.

Page 8: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

NMeshv !m"Z fx2Vj!x;v"2Eg. NMesh

v !m" can be regarded as theset of points directly influencing v. However, a mesh structureoften distorts neighbor relations. In Fig. 2, comparing the threevertices v1, v2, and v3 under similar condition, we can see bigdifferences in their valence and directional balance: (1) v1, v2,and v3 have a valence of 4, 8, and 6, respectively, and (2) v3 hasdirectionally biased neighbors. Unbalanced neighbor set givesharmful impact on the normal vector estimation as shown inFig. 13(d) and (e).

In this paper, we will present a new method for findingneighbors considering both distance and directional balancethrough an extension of a neighborhood graph.

Neighborhood graphs are briefly reviewed in Section 2.Section 3 introduces the concept of iso-influence curve. EllipticGabriel graph is defined and its properties are explored inSection 4. The computation algorithms for EGG are shown inSection 5. In Section 6, normal vector estimation methods areexplained, followed by experimental results and concludingremarks.

2. Neighborhood graph

The neighborhood graph (also known as proximity graph)of a given point set S is a graph with a vertex set of S and anedge set that defines the neighbor relationship betweenvertices. For a more formal definition of the term ‘neighbor-hood graph’, readers are referred to [10]. Edges of aneighborhood graph are defined by the influence region.Given a pair of points p and q in S, let Ip,q denote the influenceregion between p and q (Ip,q will be explained shortly). (p, q)becomes an edge of the neighborhood graph if and only ifIp,qhSZø. Depending on the definition of Ip,q, many kinds ofneighborhood graphs are possible.

The relative neighborhood graph (RNG) connects the twopoints, p and q, if and only if:

dist!p;q"%maxfdist!p;v"; dist!q; v"g for any v2S:

Let B(x, r) denote an open ball with radius r centered at x,i.e. B(x, r)Z{yjdist(x,y)!r}. The definition of RNG impliesthat the influence region for RNG is given by Ip,qZB(p,

dist(p,q))hB(q, dist(p,q)), as depicted by region (b) in Fig. 3.In other words, a pair of points p and q becomes an edge ofRNG if and only if Ip,q contains no other points of S.

Gabriel graph (GG) can be defined similarly. GG connectstwo points p and q if and only if:

dist!p;q"%!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!dist2!p;v"Cdist2!v;q"

qfor any v2S:

This gives the influence region for GG as Ip,qZB((pCq)/2,dist(p,q)/2), shown as region (a) in Fig. 3. GG is a sub-graph ofDelaunay triangulation (DT) due to the ‘empty circle’ property,which states that any two points form an edge of DT if there isan empty circle touching both of them [10,11].

Kirkpatrick and Radke [12] introduced a parameterizedneighborhood graph called b-skeleton, which has a shapecontrol parameter b for the influence region. GG is a (lune-based) b-skeleton with bZ1, and RNG is also a (lune-based)b-skeleton with bZ2. For more details on b-skeletons, readersare referred to [12].

The RNG and GG for the point set of Fig. 4(a) are shown inFig. 4(b) and (c), respectively. Although the RNG and GG arerelatively easy to compute, the resulting graphs usually reportan insufficient number of neighbors for normal vectorestimation. The b-skeleton has a diverse range of potentialapplications. Two typical application fields of the b-skeletonare (1) external shape description, i.e. the boundary curvereconstruction from point samples [11], and (2) internal shapedescription, i.e. the inter-point connection computation andpattern analysis of empirical networks in such fields as traffic orcommunication systems [12]. To the best of our knowledge,

Fig. 1. Desirable and undesirable neighbors.

V1 V2V3

Fig. 2. Undesirable neighbors with meshes.

Fig. 3. Influence regions of neighborhood graphs.

Fig. 4. Example of RNG and GG.

J.C. Park et al. / Computer-Aided Design 38 (2006) 619–626620

(a) (b) (c)

Figure 9: The mathematically simple methods fordefining neighbor graphs can produce unsatisfac-tory results. (a) illustrates the radius δ hyperspheremethod; (b) illustrates the k-nearest method (withk=5). In both cases the set of neighbors is highlyskewed. (c) shows a more desirable neighbor graph.Graphic borrowed from [13].

also compare against pseudorandom configuration selection.Pseudorandom searching is a common baseline in the auto-tuning literature. Other baselines that are used in some pub-lished results include hill-climbing style searches and sim-ulated annealing-style searches. These kinds of algorithmscould be combined with trivial failure handling, but we chosenot to perform these comparisons because they would notshed light on the central issue of the importance of handlingfailures intelligently.

We performed our experiments in the context of a re-search architecture and toolflow developed at the Universityof Washington, called Mosaic [9, 20]. Mosaic architecturesare FPGA-like, but with coarse-grained interconnect andALUs. We generated four concrete architectures by varyingtwo parameters: number of clusters and number of memo-ries per cluster. All architectures had 4 ALUs per cluster.The smaller architectures had 16 clusters, or 64 ALUs, andthe larger architectures had 64 clusters, or 256 ALUs. Thelow memory architectures had one data memory per cluster,and the high memory architectures had two. All architec-tures were assumed to have 16 instruction slots per ALU.5

8.1 The applicationsTo give a sense for the shapes of the tuning spaces, Fig-

ure 10 shows plots of all the performance and failure datagathered during all of our experimentation for one applica-tion (the FIR filter). Data for the other applications can befound in [25]. These plots include data from experimentswith many different search methods, so the distribution oftested configurations is not meaningful. There is one plotfor each architecture. The meaning of the symbols in theplots are given in the following table.

Symbol MeaningBlack dot Not testedGreen star Max instruction memory failureEmpty blue square Data memory or I/O failureRed X Compiler timeout failureFilled square Color indicates normalized quality

5The amount of instruction memory is small compared toconventional processors, but fairly typical of architectures inthe time-multiplexed FPGA family.

Performance; FIR Small, Few

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Small, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Few

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Accesses p

er

bank

0

0.2

0.4

0.6

0.8

1

Smal

l; fe

w m

emor

ies

Smal

l; m

any

mem

orie

sLa

rge;

few

mem

orie

sLa

rge;

man

y m

emor

ies

Performance; FIR Large, Many

0 5 10 15 20 25 30

Banks

0

5

10

15

20

25

30

Acce

sse

s p

er b

an

k

0

0.2

0.4

0.6

0.8

1

Col

or/s

hadi

ng in

dica

tes

norm

aliz

ed q

ualit

y

Lowquality

Highquality

Figure 10: Normalized performance and failuremodes for the FIR filter. Orange/1 is the highestperformance setting; Black/0 is the lowest perfor-mance setting. Small/Large and Few/Many referto the architecture variants that we experimentedwith.

The finite impulse response (FIR) filter application, whichwas discussed in detail earlier, has a knob that controls thenumber of banks into which the coefficient and input bufferarrays are broken. The more banks, the more distributedmemories that can be used in parallel. The second knobcontrols the number of accesses to each bank per iteration.Performance should increase with increasing values of bothknobs, because more parallel arithmetic operations are avail-able.

As you can see in Figure 10, both data memory andinstruction memory failures are common; more so in thesmaller architectures. The number of memories clearly lim-its the banks knob, which results in the large regions offailures on the right side of the plots for the smaller architec-tures. The number of parallel accesses to an array is limitedby the instruction memory of the architecture, which createsthe large regions of failure toward the top of the plots. The

Page 9: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

larger architectures show an interesting effect, when bothknobs are turned up relatively high the compiler begins tohit the time limit because the application is just too largeto compile. This creates the jagged diagonal line of failures.

The dense matrix multiplication implementation has onelevel of blocking in both dimensions in the SUMMA style[19].SUMMA-style matrix multiplication involves reading stripesof the input matrices in order to compute the results forsmall rectangular blocks of the output matrix, one at a time.The tuning knobs control the size of the blocking in each di-mension.

The Smith-Waterman (S-W) implementation follows thecommon strategy of parallelizing a vertical swath of somewidth. One of the tuning knobs controls the width of theswath and the other controls the number of individual columnsthat share a single array for the table lookup that is used tocompare individual letters.

The 2D convolution implementation has one knob thatcontrols the number of pixels it attempts to compute in par-allel, and another that controls the width of the row buffer(assuming that all of the rows do not fit in memory simul-taneously). Of the applications and configurations that wehave tested, the convolution setup had one of the most chal-lenging shapes.

For all the applications, the highest quality configurationsare directly adjacent to, and sometimes surrounded by, con-figurations that fail. This supports the assertion that inorder to have any hope of finding the highest quality config-urations a tuning method for accelerators needs an approachto failure handling that is at least somewhat sophisticated.For example, a search that had a strong preference for test-ing points far away from failures would clearly not performparticularly well.

8.2 ResultsThe experimental validation of our tuning strategy in-

volved performing tuning searches for each of the applica-tion/architecture combinations with a particular candidateselection method. The three main methods compared werethe full tuning knob search, a purely random search and thetuning knob search with the trivial approach to failures. Inall cases the termination criterion was 50 tested configura-tions. All the search methods have some randomness, sowe ran each experiment with 11 different initial seeds; thereported results are averages and ranges across all initialseeds.

Figure 11 shows the summary of the search performanceresults across all applications and architectures. To producethis visualization, the data for each application/architecturecombination are first normalized to the highest quality achievedfor that combination across all experiments. For each indi-vidual search we keep a running maximum, which repre-sents the best configuration found so far by that particularsearch. Finally we take the average and 10th/90th percentilerange across all application/architecture/initial seed combi-nations.

The headline result is that the tuning knob search methodis significantly better than the other two methods. For al-most the entire range of tests the 10th percentile quality forthe tuning knob search is higher than the mean for either ofthe other two methods.

Interestingly, it seems that the purely random search doesbetter than the tuning knob search with the trivial approach

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30 35 40 45 50

Perf

orm

ance o

f best configura

tion

Number of configurations tested

Performance as a function of search length averaged across all applications and architecturesAverage performance across all applications and architectures

Qua

lity

of b

est

perf

orm

ing

confi

gura

tion

test

ed s

o fa

r

Number of configurations tested

= mean across all applications and architectures

= 10/90 percentile region

= Tuning knob= Trivial failures= Random

Search Methods

Figure 11: Given a particular number of tests, thecomplete tuning knob strategy clearly finds higherquality configurations on average than either thepseudorandom method or the method without fail-ure prediction.

0

10

20

30

40

50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Num

ber

of te

sts

Fraction of peak performance

Tests needed to achieve specific quality levelTests needed to achieve specific quality level

Num

ber

of t

ests

Fraction of peak performance= mean across all applications and architectures

= 10/90 percentile region

= Tuning knob= Trivial failures= Random

Search Methods

Figure 12: Number of tests required to achievea specific fraction of peak performance, averagedacross all application/architecture combinations.

to failures. Our intuition for this result is that the tuningknob search that assigns a constant low quality to all failingpoints does not “understand” the underlying cause for thefailures and chooses to test too many points that end upfailing. To reemphasize, without failures in the picture atall, some completely different search strategy might be bet-ter than the tuning knob search. However, it is very inter-esting that for these applications that have a large fractionof failing configurations there is a very large gap betweena reasonably good search method that makes smart predic-tions about failure probability and one that uses essentiallythe same predictions, but treats failures trivially.

Figure 12 shows the same search results summarized ina different way. This figure has the axes swapped com-pared to Figure 11, which shows the number of tests re-quired to achieve a specific fraction of peak performance. Toproduce this plot, for each individual search (application/architecture/initial seed) we calculated how many tested

Page 10: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

configurations it took to achieve a certain quality level. Wethen aggregated across all the applications, architecturesand initial seeds for each search strategy and computed the30th percentile, median and 70th percentile6 at each qual-ity level. The essential difference between these two plots iswhich dimension the averaging is done in.

The important result that this plot shows clearly is thatfor the interesting quality range, from about 50% to 90% ofpeak, the random and trivial failure strategies take at leasttwice as many tests on average to achieve the same level ofquality. After 50 tests, the median search for both of theless good strategies are still reasonably far from the peak,and without running many more tests it is hard to knowhow many it would take to approach peak performance.

It is interesting to observe that the mean performance af-ter 50 tests for the trivial failure strategy is below the meanperformance for the random strategy (Figure 11). However,when we look at Figure 12, we see the the performance levelat which the median number of tests required is 50 is higherfor the trivial failure strategy than the random strategy. Thereason for this apparent contradiction is the difference be-tween mean and median. The trivial failure strategy had arelatively small number of searches that ended with a verylow quality configuration, which drags down the mean morethan it drags down the median.

As mentioned earlier, we have run some preliminary ex-periments on the contribution of individual features of thetuning knob search algorithm, like the target quality adjust-ment and failure probability scaling. Turning these featuresoff has a small negative impact on search quality, but theeffect is much smaller than using the trivial approach to fail-ures. We leave a more detailed analysis and tuning of thetuning algorithm to future work.

9. RELATED WORKThere are several methods for tuning applications to par-

ticular target architectures:

• Mostly manual human effort• Mostly automatic optimization by an aggressive trans-

forming compiler• Using domain knowledge to calculate parameter set-

tings from an architecture/system description of somekind• Empirical auto-tuning

All of these methods are appropriate under certain cir-cumstances. The strengths of empirical auto-tuning make ita good choice for compiling high-level languages to acceler-ators. We will consider some of the important weaknessesof the other methods for this task.

9.1 Mostly manual human effortIt is clearly possible for a programmer to tune an applica-

tion to a particular architecture by hand. Manual program-mer effort has the obvious cost of requiring lots of humantime, which can be quite expensive. In particular, when pro-grams have to be retuned by a human for each architecture,6The reason that this visualization has a 30/70 range andthe Performance has 10/90 is that the data has more“spread” in one dimension than the other. Look at Figure 11and imagine a horizontal line at any point, and notice howmuch wider a slice of the shaded region it is than a verticalslice.

it is not possible just to recompile an application when anew architecture is released. So if portability is a concern,fully manual tuning is generally not the best choice.

9.2 Mostly automatic compiler optimizationFully automatic optimization has the extremely desirable

feature of not requiring any extra tuning effort from theprogrammer. Purely static compiler optimization faces thedaunting challenge of not only adjusting the size of vari-ous loop bounds and buffers, but deciding what higher-leveltransformations should be applied as well. There is little tono opportunity for human guidance in this situation. Thespace of all possible transformations of even a simple pro-gram can be intractably large, and decades of research onauto-parallelization so far has not led to compilers that canreliably navigate this space well.

9.3 Using deterministic modelsDeriving application-level parameters from architecture-

level parameters with formulas like “use half of the availablememory for buffer X” are sometimes a good tradeoff be-tween human effort and application performance. However,the more complex the application and architecture in ques-tion, the more complex these relationships are. Even theinteractions between two levels in a memory hierarchy canbe complex enough to produce relationships that are hardto capture with formal models [26].

Another approach that fits in this category and is rea-sonably common in the FPGA space is writing a “genera-tor” (usually in some scripting language) that takes a fewparameters and produces an implementation tuned to thegiven parameters. This can be effective, but has the obvi-ous limitation that each generator only works for a singleapplication.

9.4 Empirical tuningThe space of empirical auto-tuners includes: (1) self-tuning

libraries like ATLAS[23], PhiPAC[3], OSKI[22], FTTW[10]and SPIRAL[14]; (2) compiler-based auto-tuners that au-tomatically extract tuning parameters from a source pro-gram; and (3) application-level auto-tuners that rely on theprogrammer to identify interesting parameters. Our tun-ing knob search fits primarily into category 3, though oursearch methods could certainly be used in either of the othercontexts.

One pervious application of statistical methods in theauto-tuning space is Vuduc, et al.[21]. That paper used sta-tistical models for different purposes, specifically, to decidewhen a search has reached the point of diminishing returnsand should be terminated, and to decide which of a set ofpre-computed variants of an algorithm should be selectedfor a particular data set.

9.5 Compiler tuningIn this work we are tuning application-specific parame-

ters. Many of the techniques used are similar to work ontuning compiler optimizations and runtime systems eitherto improve the performance of a single application, or a setof applications. Some of the most closely related work ofthis kind is [7, 6, 4].

10. CONCLUSIONS

Page 11: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

Adapting applications to architectures is critical for par-allel coprocessor accelerators, and tuning sizing parametersto fit specific processors is an important part of the overalladaptation process. Accelerators have hard constraints (likethe sizes of instruction and data memories) that are relatedto application-level tuning parameters in complex ways.

We proposed a probabilistic tuning method that dynami-cally adjusts the predicted likelihood of untested points (a)being high quality and (b) satisfying all the constraints, thencombines these predictions in a smart way to choose a goodsequence of configurations to test. We demonstrated thatsimply treating all failing configurations as “low quality”,and ignoring the causes of the failures leads to inferior searchperformance. There is still much research to be done onauto-tuning methods, but we believe that the use of prob-abilistic math to combine multiple program features in asophisticated way is a clear step forward.

11. REFERENCES[1] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski,

Q. Zhao, A. Edelman, and S. Amarasinghe.Petabricks: a language and compiler for algorithmicchoice. In Proceedings of the 2009 ACM SIGPLANconference on Programming language design andimplementation, PLDI ’09, pages 38–49, New York,NY, USA, 2009. ACM.

[2] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski,A. Edelman, and S. Amarasinghe. Language andcompiler support for auto-tuning variable-accuracyalgorithms. Technical ReportMIT-CSAIL-TR-2010-032, Computer Science andArtificialIntelligence Laboratory, MIT, July 2010.

[3] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel.Optimizing matrix multiply using PHiPAC: aportable, high-performance, ANSI C codingmethodology. In ICS ’97: Proceedings of the 11thinternational conference on Supercomputing, pages340–347, New York, NY, USA, 1997. ACM.

[4] J. Cavazos and M. F. P. O’Boyle. Method-specificdynamic compilation using logistic regression. InProceedings of the 21st annual ACM SIGPLANconference on Object-oriented programming systems,languages, and applications, OOPSLA ’06, pages229–240, New York, NY, USA, 2006. ACM.

[5] A. Cohen, S. Donadio, M.-J. Garzaran, C. Herrmann,O. Kiselyov, and D. Padua. In search of a programgenerator to implement generic transformations forhigh-performance computing. Science of ComputerProgramming, 62(1):25–46, September 2006.

[6] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves,D. Subramanian, L. Torczon, and T. Waterman.ACME: adaptive compilation made efficient.SIGPLAN Not., 40(7):69–77, 2005.

[7] K. D. Cooper, D. Subramanian, and L. Torczon.Adaptive optimizing compilers for the 21st century. J.Supercomput., 23(1):7–22, 2002.

[8] M. Forina, S. Lanteri, and C. Casolino. Clusteranalysis: significance, empty space, clusteringtendency, non-uniformity. II - empty space index.Annali di Chimica, 93(5-6):489–498, May-June 2003.

[9] S. Friedman, A. Carroll, B. Van Essen, B. Ylvisaker,C. Ebeling, and S. Hauck. SPR: an

architecture-adaptive CGRA mapping tool. InACM/SIGDA International Symposium onField-Programmable Gate Arrays, pages 191–200, NewYork, NY, USA, 2009. ACM.

[10] M. Frigo and S. G. Johnson. The Design andImplementation of FFTW3. Proceedings of the IEEE,93(2):216–231, 2005.

[11] A. Hartono, B. Norris, and P. Sadayappan.Annotation-based empirical performance tuning usingorio. In IPDPS ’09: Proceedings of the 2009 IEEEInternational Symposium on Parallel&DistributedProcessing, pages 1–11, Washington, DC, USA, 2009.IEEE Computer Society.

[12] A. Jain, X. Xu, T. K. Ho, and F. Xiao. Uniformitytesting using minimal spanning tree. In PatternRecognition, 2002. Proceedings. 16th InternationalConference on, volume 4, pages 281–284 vol.4, 2002.

[13] J. C. Park, H. Shin, and B. K. Choi. Elliptic gabrielgraph for finding neighbors in a point set and itsapplication to normal vector estimation. Comput.Aided Des., 38(6):619–626, 2006.

[14] M. Puschel, J. M. F. Moura, J. Johnson, D. Padua,M. Veloso, B. W. Singer, J. Xiong, F. Franchetti,A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, andN. Rizzolo. SPIRAL: Code Generation for DSPTransforms. Proceedings of the IEEE, special issue on”Program Generation, Optimization, and Adaptation”,93(2):232–275, 2005.

[15] C. E. Rasmussen and C. K. I. Williams. GaussianProcesses for Machine Learning. MassachusettsInstitute of Technology Press, 2006.

[16] C. A. Schaefer, V. Pankratius, and W. F. Tichy.Atune-il: An instrumentation language forauto-tuning parallel applications. In S. B. .Heidelberg, editor, Proceedings of the 15thInternational Euro-Par Conference on ParallelProcessing, volume LNCS, pages 9–20, Aug. 2009.

[17] B. Schlkopf and A. J. Smola. Learning with Kernels:Support Vector Machines, Regularization,Optimization, and Beyond. The MIT Press, 2001.

[18] A. Tiwari, C. Chen, J. Chame, M. Hall, andJ. Hollingsworth. A scalable auto-tuning frameworkfor compiler optimization. In Parallel & DistributedProcessing, 2009. IPDPS 2009. IEEE InternationalSymposium on, pages 1–12, May 2009.

[19] R. A. van de Geijn and J. Watts. SUMMA: scalableuniversal matrix multiplication algorithm.Concurrency: Practice and Experience, 9(4):255–274,1997.

[20] B. Van Essen, A. Wood, A. Carroll, S. Friedman,R. Panda, B. Ylvisaker, C. Ebeling, and S. Hauck.Static versus scheduled interconnect in Coarse-GrainedReconfigurable Arrays. In International Conference onField-Programmable Logic and Applications, pages268–275, 31 2009-Sept. 2 2009.

[21] R. Vuduc, J. W. Demmel, and J. A. Bilmes. Statisticalmodels for empirical search-based performance tuning.Int. J. High Perform. Comput. Appl., 18:65–94,February 2004.

[22] R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: Alibrary of automatically tuned sparse matrix kernels.In Proceedings of SciDAC 2005, Journal of Physics:

Page 12: Probabilistic Auto-Tuning for Architectures with Complex ...eters exp osed to auto-tuning en vironmen ts. Th us, ap-plications of the future will demand a cohesiv e en vi-ronmen t

Conference Series, San Francisco, CA, USA, June2005. Institute of Physics Publishing.

[23] R. C. Whaley, A. Petitet, and J. J. Dongarra.Automated Empirical Optimization of Software andthe ATLAS Project. Parallel Computing, 27(1–2):3–35,2001. Also available as University of TennesseeLAPACK Working Note #147, UT-CS-00-448, 2000(www.netlib.org/lapack/lawns/lawn147.ps).

[24] Q. Yi, K. Seymour, H. You, R. Vuduc, andD. Quinlan. POET: Parameterized optimizations forempirical tuning. In IEEE International Parallel andDistributed Processing Symposium, IPDPS, pages 1–8,March 2007.

[25] B. Ylvisaker. “C-Level” Programming of ParallelCoprocessor Accelerators. PhD thesis, University ofWashington, 2010.

[26] K. Yotov, K. Pingali, and P. Stodghill. Think globally,search locally. In ICS ’05: Proceedings of the 19thannual international conference on Supercomputing,pages 141–150, New York, NY, USA, 2005. ACMPress.

[27] H. Zima, M. Hall, C. Chen, and J. Chame.Model-guided autotuning of high-productivitylanguages for petascale computing. In HPDC ’09:Proceedings of the 18th ACM international symposiumon High performance distributed computing, pages151–166, New York, NY, USA, 2009. ACM.