a survey of methods for collective communication ... · work [80, 79] for collective communication...

A Survey of Methods for Collective CommunicationOptimization and Tuning

UdayangaWickramasingheIndiana University

Bloomington, [email protected]

Andrew LumsdainePacific Northwest National

LaboratoryRichland, Washington

[email protected]

ABSTRACTNew developments in HPC technology in terms of increas-ing computing power on multi/many core processors, highbandwidth memory/IO subsystems and communication in-terconnects, pose a direct impact on software and runtimesystem development. These advancements have become use-ful in producing high-performance collective communicationinterfaces that integrate efficiently on a wide variety of plat-forms and environments. However number of optimizationoptions that shows up with each new technology or soft-ware framework has resulted in a combinatorial explosionin feature space for tuning collective parameters such thatfinding the optimal set has become a nearly impossible task.Applicability of algorithmic choices available for optimizingcollective communication depends largely on the scalabil-ity requirement for a particular usecase. This problem canbe further exasperated by any requirement to run collectiveproblems at very large scales such as in the case of exascalecomputing, at which impractical tuning by brute force mayrequire many months of resources.

Therefore application of statistical, data mining and ar-tificial Intelligence or more general hybrid learning modelsseems essential in many collectives parameter optimizationproblems. We hope to explore current and the cutting edgeof collective communication optimization and tuning meth-ods and culminate with possible future directions towardsthis problem.

CCS Concepts•Computing methodologies→ Distributed program-ming languages; Shared memory algorithms; Parallel pro-gramming languages; •Computer systems organization→ Single instruction, multiple data; Multicore architectures;

1. INTRODUCTIONCollective operations have a prominent usage in commu-

nication bound applications in shared and distributed mem-

ory parallel paradigm, which are often coined under theterm group communication. Collective operation is a syn-chronized operation, requiring all processes involved in thecommunication to co-ordinate and work together to achievesome useful function. Thus the performance of collectiveoperations often directly affects the efficiency of these par-allel applications significantly. Many research work have fo-cused on designing efficient algorithms and optimized imple-mentations for various collective operations such as barriers,broadcast, reduction, etc found in many practical applica-tions. In literature many possible optimal algorithms andimplementations are found for a respective collective opera-tion. However optimal performance under any given condi-tion cannot be expected in all such algorithms. Thus bestcase performance depend on intrinsic factors to a particularcommunication operation or extrinsic to the underlying en-vironment. Furthermore producing a generalized collectiveoperation or operations that works under all contexts (ie:-apriori) has become an illusive goal.

Due to unavailability of any such generalized collectiveoperation, most users resort to tuning them by a handfulof parameters which they believe would suite their perfor-mance context. Nevertheless, figuring out of the relevantset of tuning parameters is not a straightforward task. Forexample as we will explore in the future sections, tuning pa-rameter set can become a highly correlated subset from alarge feature space, that may also vary depending on vari-ous runtime execution contexts such as collective algorithms,communication library, runtime, compiler, etc. Even if themost relevant parameter subset is found for the collectiveprogram, figuring out the optimal values can become evenharder problem due to number of factors such as combina-torial search space, scalability and performance.

The newest version of the Message Passing Interface (MPI)standard [51], the standard in-effect for distributed-memoryparallel programming, offers a set of commonly-used collec-tive communications. These operations cover most use-casesdiscovered in the last two decades and we thus use them as arepresentative sample for our analyses. In general, collectivepatterns reflect key characteristics of parallel algorithms atlarge numbers of processing elements, for example, parallelreductions are used to implement parallel summation andalltoall is a key part of many parallel sorting algorithms andlinear transformations. Depending on the communicationdata flow each collective can either be rooted or non-rooted.

• Rooted Collectives - data being communicated from orconverged into one node by many other participatingnodes in the collective. Example collective operations

arX

iv:1

611.

0633

4v1

[cs

.DC

] 1

9 N

ov 2

016

include Broadcast, Gather, Scatter, Reduce and Scan.

• Non Rooted Collectives - data being communicated be-tween many nodes at the same time. These collectiveoperations does not originate or destined towards oneparticular node. Example collective operations includeAllgather, AllScatter, Allreduce and Barrier, etc.

The parallel algorithms and their properties utilized inboth these contexts most often differ significantly. Thus theperformance characteristics of the collective routine will beunique to the underlying algorithm. Given the situationthat any collective implementation has no silver bullet to-wards providing the best performance for any given scenario,implementer is left with the problem of figuring out the bestpossible algorithmic choice for a collective under the givenset of constraints. This is known as the algorithm selectionproblem for collectives. As far as the existing work is con-cerned, many has focused on selecting a set of parametersthat can dictate the performance of the collective operationand performing a parameter sweep over a search space toidentify the best possible candidate for a given context.

In-order to address the inherent difficulty to figure out theexact feature space that would affect the performance of aparticular collective algorithm or an operation, implementa-tions have often relied on cues that mathematical/analyticalmodels provide in-terms of run-time characteristics. Thesemodels are able to express succinctly, parameters that mayaffect the throughput or latency of a collective. Howeveras evident from wide variety of literature in this area, it isevident that architecture, network, application specific andother considerations could greatly impact the performanceof a collective operation. Table 1 is a summerization of num-ber of these factors that could directly or indirectly influencethe performance of an collective operation.

Many Tuning patterns, algorithms and methodologies havebeen employed to search for optimal parameters values thatcan optimize the performance of an collective operation.Both static and dynamic tuning methods are known to in-cur a modest penalty in-terms of sub-optimal decision gen-eration, time taken for decision function, memory and I/Obandwidth usage for heuristic functions, structures and stor-age/retrieval etc. Multitude of methods that have been em-ployed for optimizing collectives for problems such as algo-rithm selection problem. Methods such as geometric/non-geometric mathematical modeling and parameter search ,em-pirical and statistical methods, heuristic search, machinelearning/data mining methods and static/dynamic compilerbased optimization are being been prominently utilized. Moreimportantly collective tuning are based on network, runtimeand library specific details as well. These include topologyawareness and network specific configurations such as, block-ing, non-blocking, RDMA (one-sided) and offload semantics.

Our main aim of this paper is to explore the breadth aswell as the state of the art of techniques used for collec-tive optimization problem and gain insights into limitationsof their applicability in respective methods. We will shedsome light into static and dynamic collective tuning meth-ods and their use in the context of applications that use col-lectives. We also intend to provide both a microscopic andmacroscopic view on collective optimization. In microscopicview we emphasize the ability to enhance a specific stan-dalone collective operation for latency, while macroscopicview underlies the importance of collectives to an appli-

cation or program in the midst of other computation andcommunication. Finally we hope to propose a practical andunified architecture UMTAC (Unified Multidimesional Tun-ing Architecture) for collective tuning problem that tries tocombine the best of the existing methods as well as circum-vent some of the issues discussed throughout this paper.

In the next section we will start our analysis by intro-ducing different collective algorithms that will be impor-tant in this discussion and reporting a systematic classifica-tion of the collective operations and algorithms for a properframework to be built upon. In Section 3 we will discussAlgorithm Selection problem in detail and elaborate fur-ther on microscopic optimization of collectives. Section 4reports the macroscopic optimization view on collective op-erations with number of different static/compile-time anddynamic/run-time specific tuning techniques discussed un-der different performance contexts. Finally in section 5 weconclude with the discussion and the proposed UMTAC ar-chitecture.

2. COLLECTIVE ALGORITHMS AND IM-PLEMENTATION

An implementation of a collective operation most likely todepend upon more than one parallel algorithm for reasonswe stated earlier. Inavailability of unified or even a genericset of algorithms that will fit all purposes is one of the biggestmotivations of collective tuning efforts. Thus initial tuningwork [80, 79] for collective communication operations havebeen based on enumerating through a few specific parallelalgorithms that showcase the best performance and hard-coding these algorithms into the underlying runtime imple-mentations. Initial implementations of MPICH, OpenMPIand other MPI implementations have followed this approach.It is rare to find common set of algorithms that will suitemost run-time system implementations, thus each collectiveoperation may carry number of different algorithms whichwill suite different conditions and architectures such as nonuniform/distributed memory (ie:- RDMA) , SMP (Symmet-ric multi processing) memory, different physical (network)topologies, offload architectures and execution models (ie:-runtime/energy/memory models) etc.

Few researchers have tried to report a set of algorithms[41] that will better suite the collective communication anal-ysis in terms of their effectiveness in performance, scale, en-ergy efficiency, etc. But it is important to note that this isalbeit an simplification as each different collective operationeven in the best case have variations to their implementa-tion, which may make a homogeneous approach impossible.Therefore the assumption is that best possible collective can-didate algorithm is considered only by tuning case by casebasis. In the immediate sections we would like to brief differ-ent collective algorithms possible for different operations forthe sake of completeness of our analysis. Additionally Ta-ble 2 summarizes the most widely used set of algorithms inliterature. We have specifically categorized algorithms into2 sections ’small’ and ’large’ messages to highlight differentapplication of algorithms. Generally for ’large’, a commontechnique called ”segmentation” is applied to the message bydividing it into sub parts and sending sub parts to respectiveprocesses instead of the whole message. While segmentationincurs some overhead for managing multiple messages, italso enables higher bandwidth utilization mainly due to in-

Collective Performance factor DescriptionArchitecture specific Processing speed, Memory subsystem, Cache architecture, Storage, OffloadingCommunication network Network bandwidth, Latency, Network topology, Buffer/Queue capacity, Protocols,

Link saturation, CongestionOperating system Context switching, Kernel Noise, Memory allocation/Paging, TLB hit/miss ration,

cache line size, prefetchingCollective Interface Specific Collective algorithm, Segment size, Blocking vs Non-Blocking (ie:- non blocking

wait time, progression interval)Application Specific Communication Computation ratio, Loop fusion, Collective synchronization

Table 1: List of platform, architecture, network, and other factors that could affect collectives.

creased number of concurrent messages. Segmentation alsoprovides an opportunity to overlap multiple communicationsepochs with and computation cycles, enabling better utiliza-tion of resources.

2.1 Collective AlgorithmsAlgorithms reported below are generalized parallel algo-

rithms that are based on linear, tree and dissemination basedcommunication.

2.1.1 BroadcastBroadcast is the most common collective operation found

in many of the applications where a root process communi-cate some source data into all its processes.

• Flat Tree - A single tree level topology where data isdistributed from root to all leaves

• Binary Tree - Instead of a single tree level, this topol-ogy has two (2) children for each intermediate nodewhere data is distributed from root to all leaves.

• Binomial Tree - similar to a binary tree but node dis-tribution is determined according to the binomial treedefinition [81]. Because binomial tree topology offersmore pairwise parallel communication w.r.t. binarytrees, this algorithm usually performs better.

• Pipelined Tree - A tree with some topology (ie:- eitherone above), but the message transfer is streamlined bydividing it to certain segment sizes.

• Split Binary Tree - This algorithm has 2 phases, splitand gather. Split binary tree has same virtual topologyas a binary tree, but the message transfer is stream-lined by dividing message into two parts and pushingeach half down the tree. This results in each interme-diate node and leaf node having m/(i + 1) part of themessage. In the gather phase, processes in the samelevel exchange parts and complete the broadcast.

• Double Tree - Tree based topology suffers from leavesnot using the full bandwidth available to them. There-fore a Dobule tree tries to mitigate that by mappingprocesses to two virtual trees typologies (with differ-ent leaf set) such that each node contributes some datatowards broadcast routine meantime utilizing the fullbi-sectional bandwidth.

• Chain - Each process i receives data from i− 1 andforwards to i + 1. Even-though last process must waitp− 1 number of steps until it gets the broadcast mes-sage, for large messages a pipeline strategy can yield abetter throughput.

• Van de Geijn Algorithm - message is first divided upand scattered among participating processes. Thenthe second step involves a Allgather operation (ie;- aring) where broadcast message is constructed. Thismethod is generally used for very long messages withlarge number of processes in which the bandwidth canbe utilized better with ring like scatter and allgatheroperations combined together.

2.1.2 Reduce/Scatter/GatherAll these operations implement a closely followed varia-

tion of Broadcast algorithm to fulfill the collective func-tion. for example Gather can either use a characteristictree or a chained algorithm to communicate distributed dataupstream towards a root process via a tree or a chain topol-ogy. An operation such as reduce has an additional reduc-tion step that will use the computation power of processorbefore moving into the next communication step.

2.1.3 BarrierBarrier operation is most useful when application need

some synchronization guarantee that all processes have com-pleted past a certain checkpoint. A wide variety of algo-rithms [40] are used to achieve a barrier operation whichcan be either rooted or non-rooted.

• Linear barrier - A centralized barrier algorithm whereeach participating process signals arrival on a desig-nated root process. Once all processes arrives at theroot , it signals the exit from barrier to all processesin the group.

• Tree based barrier - A hierarchical barrier driven bya tree topology where arrival signal of each processwill be pushed up the tree. Once all arrival signalsare collected, root pushes the barrier exit signal downthe tree. This algorithm scales well with number ofprocesses because of the increased parallelism.

• Tournament Algorithm - Another tree based barrier.

• Butterfly/Dissemination Algorithm - Iterative algorithmwhere signaling distance to a neighbor is increased byrelation 2r for each round r. Thus each participat-ing process has its own view of the arrival of otherthreads who arrived into the barrier. This algorithmterminates in log(p) steps by notifying all processes ofbarrier exit.

2.1.4 Allgather

Allgather is a gather operation in which the data con-tributed by each rank/process is gathered on all participat-ing processes. This operation is inherently a non-rooted typecollective , thus many different types algorithms can gener-ally be used in this collective operation. Some of the mostcommonly used ones are briefed below.

• Ring Algorithm - the data from each process is sentaround a virtual ring overlay. In the first step, eachprocess i sends its contribution to process (i + 1)/(modp)(ie:-with wrap-around). This process continues for p + 1steps where each process forwards the data it recievedin the previous step to process (i + 1)/(modp) .Theentire algorithm therefore takes p + 1 steps.

• Recursive Doubling - Here data will be communicatedbetween all processes in logp steps. Each pair of pro-cesses will start by exchanging their data to their cor-respoding peer at distance 1, however at each step i,this distance will double or in other words will be 2i

until all steps are completed.

• Bruck Algorithm - here too the data exchange will becompleted in all processes in logp steps. However in-stead of Each pair of processes exchanging their data,one process will send data to process at some posi-tive distance while receiving from a process in nega-tive distance. For example at each step i, each processforwards its data (including that may have been re-ceived from other processes too during previous steps)to process (2i + 1)/(modp) and data from (2i − 1).

• Gather followed by Broadcast Algorithm - sometimesthe algorithm can be a combination of others. All-gather operation is equivalent to a Gather to root fol-lowed by a broadcast operation among all participatingprocesses.

2.1.5 AllreduceAllreduce is a non-rooted type collective operation. It

has an additional reduction operation in which the datacontributed by each process is reduced accordingly and dis-tributed among on all participating processes.

• Ring Algorithm - Similar algorithm to one used in All-gather, however reduction will be executed at each stepbefore proceeding to the next step.

• Recursive Doubling - Similar to recursive doubling al-gorithm used in Allgather except that in each stepthe respective reduction operation is carried out onthe data. This algorithm is widely used for small mes-sages and long messages with user defined reductionfunctions due to logarithmic latency term.

• Vector Halving with Distance Doubling - Algorithmstarts with a reduce-scatter type operation where, eachpair of processes exchange half of the message witheach other and then reduced. At each of the logpsteps, exchanged message size is expected to be halved.Once this stage is finalized the distance is doubledand reduced result from half the message will be ex-changed. Therefore at the end of reduce-scatter phaseafter logp steps, 1/p part of the total message willbe communicated among all processes. An Allgather

operation can then accumulate all results between par-ticipating processes - this is normally achieved by aparallel ‘Distance halving and Vector Doubling‘ proce-dure.

• Rabenseifner’s Algorithm - This is widely used for longmessage transfer with predefined reduction operationsince this method is more efficient interms of utiliz-ing bandwidth. The algorithm completes in 2 stages ,first it does a reduce-scatter operation which is similarto reduce scatter phase in ”Vector halving” algorithm(rank r and rank rXORr2 ∗ k) and then distribute re-duced segment (ie:- 1/p of total message) among allprocesses. A Final an Allgather operation makes allparts of the reduced message available for participatingprocesses. For user defined reduction operations it istricky to use reduce-scatter operation , hence recursivedoubling will be usually preferred.

• Binary blocks - Similar to ‘Vector Halving with Dis-tance Doubling‘algorithm but uses binary block de-composition for reduce-scatter phase.

• Allgather followed by Reduce - this is a combined op-eration.

• Reduce followed by Broadcast - this is a combined op-eration.

3. COLLECTIVE TUNINGA collective operation typically has a considerably large

number of algorithmic choices for a particular implemen-tation. We indicated earlier that choosing between themcan be a extremely difficult task due to many reasons, pri-mary of which is the inherent performance characteristicseach choice possess under different contexts. This problemis compounded by the large parameter space that each col-lective operation is comprised of, which can be influencedby the factors we highlighted by Table1.

However for a given execution environment simplest of theparameter space consists of a 2-tuples {algorithm, segmentsize}. Many previous experiments [81, 62, 59] were purelybased on searching best parameter values on this 2 dimen-sional space that would produce the optimum performance.Parameters were necessarily searched through a 3 dimen-sional grid consisting of axis {number of processes, operation,message size}. While this approach is limited to a micro-scopic view of a collective operation, it provided useful in-sight into performance optimizing collective communication.Foundation for such methods were primarily based on math-ematical models [34, 14, 2] for parallel communication thathas been widely studied in the past. Following sections areexplained by the use of such models which as we see, canbe effectively utilized to predict and evaluate performanceof collectives operations.

3.1 Collective Analytical ModelsThe parallel communication models are the principal method

of formal design and analysis of parallel algorithms. A goodmodel should be succinct (few parameters as possible) andcoherent (describe every possible scenarios in a unified man-ner) for analysis while being able to capture many complexdetails of the underlying communication/run-time system.

Operation personalized? small messages large messages (segmented)Broadcast no Flat/Binary/N-ary/Binomial Tree Piplelined/Double/Split Binary Tree,

Chain, Van de Geijn Algorithm, HWspecific multicast

Barrier no Flat/Binary/Binomial Tree, Dissemi-nation (butterfly), Tournament

Reduce no Flat/Binary/N-ary/Binomial Tree,Gather + Reduce

Piplelined/Double/Split Binary Tree,Chain, Gather + Reduce, Vector halv-ing + distance doubling + BinomialTree

Scatter yes Flat/Binary/N-ary/Binomial Tree Piplelined/Double Tree, ChainGather no Flat/Binary/N-ary/Binomial Tree Piplelined/Double Tree, ChainAllgather no Recursive Doubling, Gather + Broad-

cast, BruckRing

Allreduce no Recursive Doubling, Bruck (with re-duce) , Allgather followed by Reduce,Reduce followed by Broadcast

Ring, Rabenseifner Algorithm, Recur-sive Doubling, Vector Halving withDistance Doubling, Binary blocks

AlltoAll yes

Table 2: List of algorithm implementations for collectives best suited for different algorithms

Among the most popular and widely studied parallel com-munication models are, Hockney [34], LogP [14], LogGP [2]and PLogP [45]. Each of these models have the ability tosufficiently describe communication and computation primi-tives pertaining to an underlying execution environment andthus able to provide a base for performance analysis on col-lective operations. Following contains a brief description ofthem in terms of their formulation. In all cases T refers toelapsed time to send a message of size m to its destination.

• Hockney model - T = α+ β ∗m , In the equation hereα refers to the message startup time or latency termwhile β is the time for one byte of message transfer(or reciprocal of network bandwidth). One limitationin Hockney model is that network traffic cannot bemodeled.

• LogP model - T = L + 2 ∗ o , similar to Hockney modelabove, L refers to the startup Latency. Communica-tion overhead o describes the additional time taken forprocessing network buffers, copying ,etc. The hiddengap parameter g tries to model network congestionand other communication penalties not captured bythe Hockney model and thus provides an upper boundfor number of in-flight messages possible, L/g in thiscase.

• LogGP model - T = L + 2 ∗ o + (m− 1) ∗G This isan extension of LogP model. LogP assumed constantpenalty for any message size however, LogGP makes nosuch assumptions. It models gap per byte parameterG which captures overhead to transfer large messages.

• PLogP model - T = L + g(m) is further extension ofLogP/LogGP models. Latency L is an end-to-endterm where it captures both request start-up timesand overheads. However important difference is eachalgebraic term is modeled as a function (ie:- f(m)) ofmessage size m. Therefore this model allows capturingcomplexities of non-linear networks and systems.

3.1.1 Tuning using Analytical models in Collectives

Thakur,et, al [80] have used LogGP model to analyzemany rooted and non-rooted collective communication pat-terns and algorithms towards optimization of MPI imple-mentation. Furthermore Hockney [25] and LogP [41] familyof performance models can be used to describe characteris-tics of collectives in detail. A thorough summary of all thesemodels in the face of many different collective operations [57]and their implication are also available on literature. Ourfocus in this paper is not to describe all of these communi-cation models in detail. Rather we intend to portray thesecommunication models in terms of parameter tuning andtheir implication towards optimizing collectives.

General approach towards tuning collectives may startwith formulating the respective communication pattern us-ing the desired model. Table 3 shows the evaluation of someof the reduce collective algorithms in terms of Hockney andLogGP models and the optimal segment sizes that can becalculated. The predicted optimal segment sizes are cal-culated by taking derivatives w.r.t the segment size termms where ns = m/ms on the developed models. The mostimportant task for a accurate prediction function is to fig-ure out the model parameters by careful experimentation.For Hockney this means finding out α, β parameters whilefor LogP family of models it would be L,o,G, γ and g.These experiments are usually materialized by parameterfitting on results obtained by benchmarking and profilingsoftware such as PAPI, NETPIPE [73] which can be eas-ily used to calculate parameters of models such as Hockney.Other software include logp mpi [46] Library which can cal-culate parameters of LogP family of models. Most oftenthese parameters are fitted by regression, using number ofexperiments for different communicator and message sizesand taking steady state values.

Predicting performance of an algorithm is trivial once allrespective parameters are figured out and the optimal seg-ment sizes are determined. Indeed the best algorithm isevaluated by taking the algorithm with minimum time forcompletion for a respective message and a number of pro-cesses. If the respective algorithm allow segmentation thenoptimal segment size is calculated first and substituted in theformula, inorder to evaluate the best case parameters for the

Algoithm+model Formulation Optimal segment size

Ring + HockneyT = 2(P − 1) ∗ (α+ β ∗ dm/P e)

+ (P − 1) ∗ γ ∗ dm/P eNA

Ring + LogGPT = 2(P − 1) ∗ (L+ 2 ∗ o+ (dm/P e − 1) ∗G)

+ (P − 1) ∗ γ ∗ dm/P eNA

Ring with seg. +Hockney

T = (P + ns − 2) ∗ (α+ β ∗ms + γ ∗ms)

+ (P − 1) ∗ (α+ β ∗ dm/P e) ms = 2

√(m ∗ α)

(P − 2) ∗ (β + γ)

Ring with seg. +LogGP

T = (P − 1) ∗ (L+ 2 ∗ o+ (ms − 1) ∗G)

+ (ns − 1) ∗ (max(g, (γ ∗ms + o))

+ (ms − 1) ∗G)

+ (P − 1) ∗ (L+ 2 ∗ o+ (dm/P e − 1) ∗G)

ms =

= 2

√(m ∗ (g −G))

(P − 2) ∗G if g ≥ o+ γ ∗ms

= 2

√(m ∗ (o−G))

(P − 2) ∗G− γ)

Recursive Doubling +Hockney

T = log(P ) ∗ (α+ β ∗m+ γ ∗m) NA

Recursive Doubling +LogGP

T = log(P ) ∗ (L+ 2 ∗ o+ (m− 1) ∗G)

+ log(P ) ∗ γ ∗mNA

Table 3: Analytic models and predicted optimal segment sizes for parallel communications in AllReducecollective in Hockney and LogGP models

equation. Analytical models are one of the best methods topredict performance for sparse data [58]. This is especiallyuseful for large scale systems where it is impossible to per-form an experimental evaluation on the whole system .

3.1.2 LimitationsOne of the simplest mechanisms for tuning collective op-

erations using analytical models are by selecting one partic-ular model such as LogP and compare the results by exper-imentation as stated above. However one particular modelcan turn up over complicating or over simplifying the actualcommunication. This would therefore most often underesti-mate or overestimate with regard to the actual experimen-tal results. Thus selecting the best model among numberof different models could be the optimal strategy, resultingin querying all available models [58] and selecting the bestmethod with successful prediction rate. In some cases wherea tie occurs between model prediction weighted preferencecan be attached to a particular model, for example LogGPis known to produce better results in heavy congested net-works compared to Hockney model.

Even-though this method of tuning seems fairly straight-forward and simple to engage , many a research has raisednumber of concerns over the applicability of such theoreticalmodels. Listed below are some of the issues raised.

• Over fitting or under fitting of model parameters - Allof the performance models discussed above has theirown weaknesses. Experiments have shown that charac-teristics of the network and environment plays a largerole in selecting the best fit model. Some of the net-works that have non-linear characteristics such as de-scribed by PLogP may have better chance of predictingthe behavior compared to other models. Some analy-sis [58] have shown the linear assumption that modelssuch as Hockney and LogP/LogGP present in theirmodels most often results in underestimation. Fur-thermore some of the networks that may not allowfull bi-sectional bandwidth and/or may not allow ”full-duplex” communication will cause disruption to the

models which are mostly biased on such assumptions.

• Difficulty of parameter estimation - usually requiresconsiderable amount of experimentation and some rig-orous statistical application to derive best fit param-eters, which may take time and effort. Some of themodels such as PlogP have described its parameters interms of function of message size to capture the nonlinearity of a system. Even though PLogP have prac-tically proven to provide better results , its analyticalanalysis and formula simplification can be much harderthan in the case of LogP/LogGP and Hockney mod-els. Furthermore the models such as PLogP requiresthe extra effort of finding a smooth curve under a nonlinear assumption, which can be a hard problem due tothe uncertainty around actual complexity of the fittedcurve (over-fitting/under-fitting).

• Predicting optimal segment size - optimizing segmentsize is limited for a segmented algorithm only. Evenwith a segmented algorithm, a theoretical analysis canproduce an optimal value that is not feasible for un-derlying runtime. This is particularly true when thepredicted segment sizes are not a a multiple of partic-ular data type, a power of two or even approximatedvalues are not available to the communication runtime.In such cases closest segment size may perform sub op-timally.

• Difficulty of implementation - A fully fledged auto-mated tuner for collectives based on analytical modelscan be difficult to implement for several reasons. Suchimplementation either requires expression parser thatwill build an object model of analytical equations in-ternally and provide functionality for parameter esti-mation, or an hard-coded function that will be encap-sulate all current models and generate decisions func-tions. Furthermore an in-depth analysis of the under-lying algorithm is essential to model the parameters.

3.2 Statistical Techniques

Statistical approach to ”algorithm selection” may producean alternative but efficient solution to this problem. An em-pirical estimation technique also commonly known as ”Au-tomated empirical optimization of software” (AEOS) [86]has been applied to tuning collective operations. Input datafor AEOS were collected by a series of experiments pow-ered by exhaustive search/heuristics. These were then usedin AEOS tools to generate an optimal decision function fora respective collective operation [81, 25]. Similar methodshave been tested and reported to be successful in math li-braries geared towards matrix algebra based such as ATLAS[85, 8] and FFTW [29]. While some have focused applyingAEOS for optimizing a generalized collective (virtual) whichare unaware of any physical topology [81], others [25] havealso used the technique to tune collective algorithms to aspecific network topology. The later is provided as an ex-tension to LAM/MPI [70] with support for both topologyaware and generic algorithm routines and routine generatorsfor network specific algorithm generation ie:- automated col-lective generation for Ethernet switched clusters. Howeverthis approach requires a tuning driver component that willdrive all the required experiments extensively, regardless ofthe topology aware or unaware methodology used.

3.2.1 Tuning by Empirical Estimation TechniquesExperimentation plays a key role in AEOS based statis-

tical learning of optimal collective algorithm and segmentsize. Implication of such experimentation generally meansa requirement for a dense data set that can be fed into adecision generation function to produce accurate results. Inorder to achieve such feat, more than one stage of carefullyplanned experimentation phases are necessary for the sys-tem at hand. First phase consist of experiments for search-ing for an algorithm dependent optimal parameters such assegment size (c.b. section 2) for a given number of processesN and a message size m. Segment sizes can include therange of multiples of basic data types or power of 2 of prim-itive data types (ie:- 4B, 8B, 16B, 32B, .... 512KB,1MB)These experiments are repeated for all possible operationsand algorithms.

A second phase typically consists of experiments to findthe best possible algorithm (including segmented versions)for a given number of processes and operations. Messagesizes for these experiments are similarly sequenced from abasic type to some upper bound. Final phase is completedby repeating phase 1 and 2 for all possible number of pro-cesses. Reducing the large experimental data set is a majorfactor towards success for empirical estimation techniques.Primary experiments focus on shrinking the total number oftests in 3-dimensional space, {processes, msg size, op}. Thiscan be achieved via interpolation along one or two axes, forexample reducing message size space from {8, 16, 32, 64..1MB} to {8, 1024, 8192.. 1MB} [81, 25]. Furthermore ap-plications can be instrumented to build a result table orcache of only those collective operations that are required.Some focus has also been on using black box solvers witha reduced set of experiments, such that complex non-linearrelationships between points can be correctly predicted.

Additionally AEOS based tools such as OPTO [13], MPI-Advisor [30] can mainly operate as external tools to tuneapplication runtimes and at worst case would require sin-gle application run to generate optimal parameter decisionor recommendations. These external tools would perform a

more general form of tuning not only limited to collectivesbut also other aspects such as shared memory performance(task pinning ,etc), point to point performance and one sidedtransports such as infiniband. They will usually employ ageneral bench-marking stage at install time which will mea-sure different aspects of the system architecture and topol-ogy and related aspects. This stage will be followed by a sin-gle application tuning phase where a) information about theapplication is collected using existing MPI profiling interface(PMPI) and its extensions [65, 71] and MPI Tools Informa-tion (MPI T) interfaces b) A detailed AEOS analysis thattranslates the collected data into performance metrics whichidentify specific performance degradation factors, c) finallynecessary parameter optimization recommendations for se-lected and supported categories (ie:- collectives, p2p,etc)

3.2.2 LimitationsLinear or exhaustive search to find optimal time t for each

change in the method combination used {algorithm, segmentsize} may take significant time depending on the number ofdata points in the result set. Thus limitations of this ap-proach stem from the fact that large quantity of experimentsneed to be conducted and exhaustive style parameter searchshould be performed. Following lists some of the limitationsof the empirical approach.

• Analysis requires a dense result set - A significant amountof time is spent on experimentation if the applicationsneed to run on many processor sets and message sizes.Even though many interpolation techniques have beenapplied, success of which is largely system and appli-cation dependent. Reducing experiment set has gen-erally shown to degrade performance of decision func-tions.

• Large search space - A dense result set would requiredecreasing the time taken to search for optimal timefor a particular case, which in-turn would result in nav-igating lesser number of data-points. Since exhaus-tive search is out of the picture, application of heuris-tic based optimization are evident in experiments [86].However regular optimization techniques are not suit-able because of the time per iteration for each algo-rithm over a range of segment sizes may not be com-monly converging to a constant. Therefore modifiedhill decent optimization techniques, Modified Gradi-ent Descent (MGD) and Scanning Modified GradientDescent (SMGD) based heuristics [81] have emergedand acceptable speedups were also shown. Still suc-cess rate of such search algorithms depends highly onthe dataset (ie:- can be fitted in a smooth curve) andthe function (ie:- don’t have many saddle points or lo-cal optimas, ridges/alleys can increase iteration count)being followed. Better heuristics for conducting lessexperiments while still being able to obtain optimalperformance for a given message size and number ofprocessors have yet to be developed.

• Data collection interface support - Tools such as OPTO,MPI-Advisor depends tightly on the ability to collectapplication performance data non obtrusively from in-terfaces such as PMPI, MPI T and other low levelhardware interface support such as PAPI (PerformanceApplication Programming Interface). If any of the re-quired interfaces are unavailable in underlying runtime

environment then that would greatly affect the accu-racy of generated recommendations.

3.2.3 Tuning by Dynamic Automated empirical opti-mization

Dynamic self adapting tuning techniques in STAR-MPI[26] that are built on top of classic AEOS methods haveshowcased its usefulness in many applications such as FFTW,LAMPS and NBODY. Unlike static techniques which en-force user to tune collective before an application run, dy-namic tuning allows adaptability for different network, plat-form and architecture specific conditions during the execu-tion time of the application itself. Secondly such methodscan account into and eventually adapt to application specificfactors such as noise, load imbalance that can vary signif-icantly during application phases. Such functionality canbe of paramount importance in environments where statictuning can be prohibitively expensive, for example a largescale application runs in a super-computing cluster withmany partitions. An implementation of MPI called STAR-MPI consists of similar dynamic system design for searchingbest performing algorithms during runtime using a proposedtechnique called ”delayed finalization of MPI collective com-munication routines (DF)”.

STAR-MPI runtime system specifically alternates between2 states. a) Initial measure-select stage where it evaluatescollective algorithm performance from a algorithm reposi-tory and chooses the best performing version. b) monitor-adapt stage where runtime continuously monitor the per-formance of the selected algorithm and revise the algorithmchoice/decision when the performance of the selected algo-rithm deteriorates. This monitoring stage is critical to en-sure that STAR-MPI will eventually converge to the mostefficient algorithm for a given execution environment.

3.2.4 LimitationsFor dynamic automated tuning systems to work it is clear

that overhead of continuous tuning routine must be kept tominimum. The runtime states for such systems, for exam-ple ”measure-select” in STAR-MPI amounts to the highestpenalty since it has to enumerate all possible algorithms foroptimal decision. Following lists number of limitations thatmay inhibit the applicability of this method.

• Overhead of tuning - dynamic tuning impart massiveoverheads in the initial stage (due to large combinato-rial space even in trivial 2-d space {algorithm , segmentsize} and depending on the application and environ-ment, significant overheads during monitoring stagesas well. Results have reported that dynamic tuningcan amortize these costs over large application runsby selecting the optimal algorithm as early as possiblecombined with techniques such as ”algorithm group-ing” [26]. However many concerns remain for shortrunning applications and irregular conditions whereadaptation stage may take too long to converge to astable state (ie:- an optimal algorithm)

• Limitations in optimization techniques - One of thegoals of dynamic tuning systems are to find the op-timal algorithmic choice with minimum time. Manyof the optimization methods employed to achieve thiscurrently are adhoc at best. For example ”algorithmgrouping” [26] technique which reduce the parameter

space for experimentation, relies on manual inspec-tion/analysis on a large set of algorithms based onsome performance cost model to group them. There-fore grouping without any clear criteria can result inselection of sub optimal algorithms, inducting heavierpenalty on the system.

3.3 Graphical EncodingEmperical methods for tuning collectives present a formidable

challenge in terms of the density of the input data set. Ap-plicability of encoding methods may provide a solution foremperical data based tuning where relevant input data canbe subjected to some form of compression until they are usedat decision time. A naive decision map data structure willstore all information about the optimal collective algorithm(and related parameters) and then be used to apply stan-dard compression algorithms to reduce it to a manageablesize while maintaining a acceptable predictor function (withsufficient accuracy). As a solution, a quad tree [28] basedencoding scheme for storing, analyzing, and retrieving opti-mal algorithm and/or segment information for a collectivewas introduced by [61].

3.3.1 Implementing Quad trees for decision mapsQuad tree require creation of a decision map for a par-

ticular platform/system of collective operations. A decisionmap is commonly a matrix of 2 dimensional space {numberof processes, operation, message size }. A higher dimen-sional map is also possible, however this would result in aoct-tree/hyper cube instead of a quad tree for decision func-tion generation . In order to materialize a quad tree, eachN data points needs to be mapped to a 2k ∗ 2k square grid.A naive replication can fill missing data points of a deci-sion map with unequal dimensions of nxm (ie:- n , and mdistinct values of number of processors and message sizes).Even-though replication would not affect the accuracy, it canimpact encoding efficiency by generating bigger trees.[61]

An exact quad tree can be built from the aforementionedmatrix by using all measured data points without any lossof information. The depth of an exact tree is determinedby the equation k = log4N [61]. This is considered theupper bound of search depth of a quad tree, however goalof such encoding schemes is generally to limit the size ofthe tree and/or query depth while keeping accuracy of theprediction under a certain bound. A depth limited quadtree is such technique where the tree is built by ignoringall data points moving beyond some pre-determined depthlimit. Alternatively an accuracy-threshold limited tree canbe built with a pre-determined accuracy lower bound. Forexample if a region has 70 % of the same color (ie:- algo-rithm segment index / algorithm) then further splitting ofthe quad tree region can be ignored. Understandably meanperformance penalty increases when ever restricted depth oraccuracy threshold gets decreased [61]. However both typesof trees have exhibited acceptable performance results withless than 10% penalty for quad trees with mean depth is aslow as 3 levels or less [61].

An efficient Implementation of quad tree can be either anencoded in-memory structure or a compiled decision func-tion that can be queried at runtime. Performance resultsfrom [61] show that average decision time for a compiledfunction is better than the in-memory version. Howeverthey also report that their in-memory implementation is a

non-optimized version. Therefore there is no consensus onpreference for either method, thus the choice can be left withthe implementer who would be responsible for the efficiencyof the respective method.

3.3.2 LimitationsEncoding schemes such as Quad trees have shown com-

parable or better promise in terms of accuracy and meanperformance w.r.t earlier techniques. Compared to statis-tical estimation quad trees operate at a fraction of cost tostorage/retrieval and decision time. However these compres-sion schemes are also known to possess some weaknessesin their structure [57]. Many of the disadvantages of quadtree encoding scheme stems from the limitations of the datastructure itself, thus they have been listed below.

• Decision querying fails to capture specialized cases -quad trees are data structure tailored for 2-dimensionaldata. Therefore they are not capable of generatingsingular rules or decisions. For example quad tree willfail to capture a collective algorithm decision, if theactual rule is true for all number of processors that areof power of 2.

• Sparse data - quad trees structure acts like a low passfilter which can cut off some high freq information forthe decision function and thus work best only on densedata sets. Therefore any decision made from a sparseregion of the tree has too fewer data points or mea-surements to predict with sufficient level of accuracy.

• Dimensionality of Input data - quad tree encoding doesnot work for any input data with dimensions greaterthan 2. Other encoding schemes such as oct-trees maywork in this scenario however efficient implementationof such structures are largely an unknown given thecompromise of fast decision making and accuracy.

• Data reshaping - is a problem for a quad tree for adata set that has large uneven decision map. This willaffect the encoding efficiency greatly due to additionalrefilling required to satisfy a square region constraint.

3.4 Machine Learning ModelsData mining is an alternative technique to algorithm selec-

tion problem. Data mining makes use of a classification func-tion instead of some analytical model, estimation techniqueor a decision map to predict the selection. The measure-ment points of a resultant 3-dimensional space of {operation,number of processes, operation, message size } is well suitedfor a supervised or unsupervised learning function to ac-curately predict optimal collective method of {algorithm,segment size}. Similar methods such as parametric and nonparametric model based mining techniques have been usedfor other problems such as matrix matrix multiplication [83]to construct boundaries or switching points between the al-gorithms based on experimental data.

Although unsupervised learning methods such as cluster-ing can be used to discover optimal methods, evaluation ofresults can be computationally intensive at runtime there-fore can steal significant portion of compute cycle time. Butsupervised training techniques can be less taxing on the sys-tem since the prediction based on the trained model wouldrequire less computation because the number of outcomes

or classes are known and decision model is built apriori. Su-pervised learning methods such as regression/classificationtrees for example, IDE3, CART, C4.5, SLIQ, SPRINT [3],support vector machines (svm) [82], neural networks, aretherefore a natural fit for the ”algorithm selection” problem.

3.4.1 Decision/Regression TreesA decision tree is a predictive model which maps obser-

vations about an item to conclusions about relevant dataitem target value. When the target variable takes a finiteset of value labels it is called a classification tree. C4.5 clas-sification tree builds decision trees from a set of trainingdata (in the same way as ID3 decision tree), using the con-cept of information gain ratio criterion (Hunt’s method ).At each internal node of the C4.5 tree, algorithm choosesan attribute of the set { number of processes, operation,message size } to effectively split its subset of data points toapproach a decision at the terminal node. C4.5 tree is gener-ally pruned by tweaking its parameters (ie:- confidence level,weight, windowing) to decrease memory footprint and im-prove decision time, while keeping any incurred performancepenalty within acceptable limits [60].

A detailed study in [60] reports C4.5 based exact deci-sion tree approach, to compare performance with prunedversion of the decision tree that was enforced by changingconfidence level c and weight m parameters. Increase inweight would decrease the size of C4.5 tree and number ofleaves thus limiting number of fine grained splits. Sameeffect can be achieved by decreasing confidence level thusresulting in more aggressive pruning. Both situations wouldlead to coarser grained decision making (under-fitting), thusresulting in higher misclassification error. It is important tonote that main objective of the such pruning criteria wouldbe to achieve a sufficiently small decision tree, yet equippedwith a acceptable accuracy function to predict optimal per-formance method for many collective operations as possible.Experiments have reported [60] that generated decision treeshad low performance penalty even for heavily pruned trees.

As described earlier, a more generalized approach to op-timize runtime parameter configurations, while not limitingto collective only operations, are also possible via regressiontree learning. Frameworks such as OTPO [13], and Open-MPI allows extensions to use specific knowledge of the un-derlying system (acquired during an off-line training phase)to build a decision function [56] capable of estimating op-timal parameter configuration. These extensions can learnthe features of the application by static and dynamic analy-sis of code using various tools, etc for offline learning andthen profile application at runtime to make optimal de-cision. In [56] REPTree, a fast tree learner was used tobuild a regression tree to train a predictor from large repos-itory of feature, configuration and measurement data (ofthe form (Fi, Ci, speedup)) to build a decision tree dt suchthat speedup = dt(Fi, Ci). Then at runtime the predictoris queried several times to get best configuration possibleCbest , for the given feature set Fk of the application whichsatisfy speeduphighest = dt(Fk, Cbest). Results have shownfavorable results with experiments on 2 separate applications(ie:- Jacobi Solver and Integer Sort) demonstrating that thepredicted optimal settings of runtime parameters achieve onaverage 90% [56] of the maximum performance gain.

3.4.2 Limitations

Unlike quad tree methods , decision trees are obliviousto dimensionality of input data thus allowing it to use formulti dimensional input and similar collectives. Also deci-sion trees have shown higher accuracy and thus least averageperformance penalty [57] than any other method studied inliterature. However it suffers from several major weaknesseswhich are highlighted below.

• Difficulty to control decision trees - It is difficult to ma-nipulate classification trees unlike other methods wediscussed earlier (ie:-quad trees). Even though tweak-ing parameters allow some form of control to decisiontrees , the depth and size of tree can never be predictedapriori. The adhoc nature of decision tree heuristic(for example information gain does not rely on anystatistical/probabilistic framework), results in a morevariance in decision path and eventually impacts theperformance of the tree.

• Limited to rectangular hyper-planes - C4.5 and simi-lar classification split space into well defined regions.Thus they are unable to capture the borders which arefunction of composite attributes such as ”total mes-sage size”, ”even communicator size”, and ”power-of-two communicator size”. However, this problem canbe addressed by a technique called constructive induc-tion. But such approach requires the user to have priorknowledge about the data which is not always prefer-able.

• Randomness/bias in Input - If the class distribution isclose to random, classification algorithm will be unableto produce accurate decision trees. Furthermore thisis true for training sets that are highly biased towardsone or two major classes labels thus producing highermiss-classification error.

• Weak Learner - Decision trees are generally consideredweak learners due to over fitting and susceptibility tosmall perturbations in input data [23]. Thus decisiontrees in general lacks prediction power and performpoorly on unseen data.

• Runtime overhead - Regression tree predictors thatsearch multi dimensional data for optimal decision,such as the case for searching best parameter config-uration, requires several iterations at application run-time for convergence. This is a form of dynamic tuningand thus can be source of significant overhead depend-ing on the platform and application.

3.4.3 Dynamic Tuning with Neural networksArtificial Neural network (ANN) are a class of machine

learning models that can map a set of input values to aset of output values and then use optimization techniquessuch as ”back propagation” [33] to successively learn the in-put data for accurate predictions on unseen input. Earlierstudies have reported ANN’s as predictors for finding op-timal parameter configuration setting for distributed appli-cations [56], by training a model that captures number ofapplication/system tuning features such as ratio of collec-tive communication, ratio of point to point communication,number of processes , data size ,etc. ANN is chosen withfeature vector as input and configuration vector as the out-put forming a model such that Cbest = ann(Fk) for a given

feature set Fk . A three layer feed forward back propaga-tion network, with 10 neuron hidden layer and input/outputfunction of sigmoid/logorithmic-sigmoid was able to achievea maximum performance gain of 95% on 2 popular applica-tions [56] .

3.4.4 Limitations

• Input bias - A robust training set is paramount to thesuccess of an ANN. If training data is biased then themodel trained can overfit, making it less usable forunseen data.

• Training time - ANN’s (with few hidden layers ) areknown to take very long training time to effectivelytrain a model with traditional back propagation opti-mization techniques. This is especially true when inputfeature vector Fk is long and has many classificationlabels.

• Implementation difficulty for classification - Tuningdistributed applications for large number of parame-ters ranging from algorithm index, segment size to ar-chitecture specific ones like mpi affinity, eager thresh-old,etc , can soon become increasingly hard problemto classify due to the explosion in connections to eachof the output layers of an ANN. More than 80 classlabels have been used in the study [56] , but it is notclear how effective ANN’s could be for wider range ofapplications and instances. More importantly staticor manual labeling of a set of handpicked runtime con-figurations would not generate the best possible con-figuration for a given feature vector, which was alsoevident from some of the predicted results on Jacobiand IS applications [56] .

3.4.5 Rule based Dynamic Feedback controlMany of the machine learning techniques discussed thus

far used supervised learning to predict the best possibleselection for an optimal collective operation by training adataset offline. However guided learning can take time andeffort on relevant experimentation necessary to produce aneffective labeled data set in order to build a predictor model.Therefore ultimate level of control is to avoid training phaseentirely to free the user from such work. Given sufficienttime, a self adapting rule based runtime[24] could automat-ically generate an optimal decision.

These frameworks have used existing runtime infrastruc-ture such as OpenMPI to facilitate parameter value basedfeedback (ie:- using standardized parameters and attributesof MPI) for dynamic rule generation and adaptation. Atthe heart of the rule based decision engine is the rule tablewhere expressions (ie:- set of rules) can be constructed viastandardized parameters, operators and terminal functions.(terminals refer to the function pointers that correspondto a particular collective algorithm and segment size). Ateach runtime iteration window feedback loop modifies or de-velop the rule table according to the measured performancedata. We show In section 4, how dynamic tuning and modellearning can be applied to collective applications.

3.4.6 Limitations

• Runtime overhead - feedback control loop could poten-tially add significant overhead to the critical path ofan application.

• Static rule set - does not necessarily learn new featuresof the system.

4. APPLICATION CENTRIC TUNING FORCOLLECTIVES

Many of the collective optimization techniques discussedthus far maintained a microscopic view on the collectivecommunication patterns, hence tuning was focused solely onimproving latency of a respective operation - for example ona give optimal collective operation, select the best algorith-mic choice or the least cost communication model. Howeversuch standalone only perspective on collective communica-tion is generally not sufficient enough for an optimizationcriteria since many of the collectives are geared toward solv-ing a real world problem that many other components apartfrom collective operations itself will need to fit seamlessly toachieve an optimal performance.

Applications such as mathematical solvers including manyvariations of FFTs (1-D, 2-D, 3-D, FFTs), Integer Sort, N-body Solvers and many of the scientific applications not onlyconsist of collective operations but other modes of commu-nication in the form of point to point, inter-node, intra-node communication , NUMA aware communication or evenlarge number of computation phases that consume proces-sor power. Therefore it is important to consider a criteriathat would tune collectives in the midst of critical applica-tion specific factors such as compute/communication phases,load imbalance, irregular memory/IO patterns, noise, etc.

4.1 Overlapping Communication with Com-putation

A major drawback on distributed memory parallel appli-cations compared with the single shared memory symmetricmultiprocessor approach (SMP) is the latency bottleneckincurred by the underlying network interconnect technolo-gies (ie Gigabit Ethernet, Infiniband family of technologies, etc). Therefore enabling host processors to perform com-putation while network communication is performed on thebackground is among one of the most desirable properties anapplication can have towards achieving optimal throughputperformance. The overhead caused by network I/O gener-ally surpass any other internal latency generated by memoryaccess or cpu processing, even with the most cutting edgenetwork technology at hand, making synchronous networkoperations obsolete for modern day high performance appli-cations.

In theory though true asynchronous communication canhappen, by delegating the entire operation to a capable net-work card, which would then be able to bypass the host cpuentirely to successfully initiate and eventually complete thecommunication. However in practice overlapping computa-tion with communication is not always straightforward be-cause of the inherent data dependencies that may limit theoverlap potential, intricate low level details of the communi-cation libraries, rigid nature of common messaging middle-ware, performance tuning of parameters and portability is-sues present with the legacy high level application code.Number of efforts have shed light on benefits of communica-tion computation overlap [5, 69, 21, 49, 38] and showcasedmany algorithms to leverage overlap potential in applica-tions such as multi dimensional FFTs, Gradient solvers, LUfactorization, sorting, Finite Difference, etc.

4.1.1 Programmable OverlapHoefler, et al have reported [38, 37]- a library solution

to enable overlap with functional templates driven by nonblocking MPI collectives. The non blocking framework theypresent provides a platform for traditional MPI applicationsto use collection of patterns to transform kernels involvingblocking collectives communication to non blocking versionthus allowing applications to extend their overlap windowin loop iterations to interleave communication with compu-tation.

Their method is useful when users are compelled to avoidcompiler aided complex automatic transformation that willdemand an extensive static analysis of code to detect datadependencies to guarantee inter-loop independence for re-quired transformations. The proposed methods by Hoe-fler,et al [37, 35] use generic programming with a standardcompliant C++ compiler to generate expression classes toseparate communication and computation. The paramet-ric classes for tiling factor (size of computation chunk) andcommunication window size (number of communication re-quests) can then be leveraged to find the best possible over-lap communication and computation strategy for efficientpipelining. They report the efficacy of this approach by theuse of benchmarks (21% gain) as well as applications suchas 3D FFT (16% gain).

4.1.2 Static AnalysisPrevious technique force the application developer to re-

think their application in terms of non blocking semanticsand deconstruct programming primitives within to accom-modate such changes which may be time consuming. Someof the efforts have been focused on to delegate these kindof transformations to compilers relieving the programmerburden and increasing portability.

Danalis, et al [16] takes a canonical application kernel in-volving a computation and a collective transfer and thenapply a general transformation strategy to develop it intoan overlap enabled state. The results they report compar-ing optimized and non optimized versions of MPI as wellas transformed versions with specialized one-sided low levelframeworks, show many possible opportunities towards codeoptimization of collective applications. One of the key dis-tinctions of their approach is that not only transformationsare taken into account but also the applicability of true asyn-chronous communication libraries which use the underlyingRDMA enabled network hardware fabric such as GasNet [9]and Myrinet/GM.

A common issue with non-blocking I/O libraries is thatsome of the operations may immediately return the controlto the user, yet the underlying host processor and memoryare busy with performing the data transfer. Therefore bet-ter utilization of network hardware is essential either by thehigh level communication libraries or low level programmingframeworks to maximize the benefits of communication andcomputation overlap. Furthermore another subtle consid-eration highlighted was to search the optimal granularity oftiling and pipeline length (maximum requests in flight beforebeing checked for completion) [16] that should be consideredfor overlap, although no solution was provided.

Modern day compilers can perform control and data flowanalysis to determine the earliest a data element can be usedfor communication initialization and the latest a data itemcan wait for finalization before being used again or redefined.

Even though such dependency analysis is complementary toapplication of overlap strategies, these methods have hadlimited use due to the fact that compilers do not try to eval-uate and manipulate the invocations to communication li-brary routines. CC-MPI [44] is an MPI extension effort thattry to optimize applications using collective communicationby providing hints about the underlying communication tothe application compiler. Compiler based optimization sup-port is useful since it increases ease of portability of manyapplications and kernels.

The potential of static compile time optimization for com-munication and computation overlap is reported by numberof studies [68, 20, 17, 18]. Static source code analysis tech-niques such as control flow graphs and data flow analysison program regions has proven useful in software testing,debugging and optimization. These methods have also beentried on [10, 31, 72, 75] parallel programming models such asMPI, providing valuable insight into relationships betweenapplication behavior and communication topology. Howeverthese type of analysis have turned up with mixed results inpractical usage where most of them were limited to bench-mark studies with functional prototypes in some supportedcompiler infrastructure like ROSE [64] or were based on apure theoretical framework.

A more pragmatic categorization of wide variety of trans-formations possible (not limiting to applications with collec-tives) in the context of generalized communication patterns(including collectives) were reported by Danalis, et al in [19]with analysis based on NAS parallel benchmarks and somescientific application code. The aforementioned study pro-vides a description of the data effects of MPI function callsin traditional data flow and also a systematic safety analysiswhere a compiler infrastructure can use to determine the safetransformations for code with comm-computational overlappotential. We highlight several of the key transformationsbelow.

A. Conversion from Blocking Collective API.Non-blocking collective calls impede the communication-

computation overlap potential, thus it is necessary to trans-form any blocking call to a pair of non blocking collective op-eration and the progress invocation (”Testl()” or ”Wait()”).The time between the collective call and the progress is gen-erally called the overlap window - higher this value, a greaterability for overlap.

B. Communication Library Specific Optimization.Some of the collective operations can be replaced by low

level library operations that may mitigate the latency ef-fects introduced by high level abstractions and library lay-ers. Specialized communication libraries like Gravel [15] ,Gasnet [9], Photon [48] can support true asynchronous com-munication (given the hardware capability) and hardwareassisted collective communication ( multicast, unicast, fea-tures), which may enable code transformations for a greateroverlap window.

C. Decomposition of Collective Operations.As discussed in section 2, many of the collective communi-

cation algorithms are a collection of point to point routinesthat get invoked in some order to achieve the desired col-lective result. Therefore If this sequence of point-to-pointoperations is in-lined into the program and thus exposed to

the application layer.Therefore a compiler can be made aware of the point to

point communication algorithm of a collective and thus re-structure the code, optimizing the individual transfers byoverlapping them with computation. In the case of hardwarepowered collectives this strategy is not possible, but nonblocking collective frameworks [37] are more suitable for thisscenario.

D. Variable Renaming/Cloning.Similar to ”register renaming”, false dependencies between

variables used in collective communication and computationcan be eliminated by cloning the variable into a new one.Thefore variable cloning will lead to more effective inter-leaving of computation with communication operations.

E. Code Motion of Collectives .Code motion refers to the set of transformations that ef-

fectively enhance the overlap time window, by hoisting col-lective initiation invocations to the earliest possible positionin the code while sinking the completion/termination invo-cations to the latest position possible.

For example for a function that broadcast a value and doessome independent computation can be applied a transforma-tion that hoist MPI Ibcast() to the beginning of a functionand fit all computation between it and the sink MPI Wait(),just before the return statement. However such transfor-mations are not always trivial since too less independentcomputation units can still impede the potential for overlapand too much computation units can degrade the collectiveoperation throughput.

F. Loop Fission / Loop Nested Optimization (LNO) .Some transformations relax data dependencies of the code

by splitting computation loops into dependent and non de-pendent sections w.r.t communication and hoisting the nondependent computation section out from the main loop ina safe manner. This transformation usually applied afterregular transformations such as code motion and variablecloning are completed.

G. CCTP - Communication/Computation Tiling and Pipelin-ing.

CCTP [16] is generally applied to point to point commu-nication segments, but depending on the scenario can be ap-plied to collectives as well [36, 37]. Main idea behind CCTPis to split both data transfer in a communication invocationand a computation in to sufficiently large ”tiled” segments(provided the application allows such transform), and createa pipeline to overlap the split segments. Christian Bell, etal [5] takes a 3 dimensional FFT and performs the necessarysplit operations to transform a FFT plane into either slabs(multiple) or pencils (single row) for the transpose operationAll to ALL() over the network.

H. Loop Peeling.A collective involving a neighbor/stencil exchange may

define the communication buffer just for the first few itera-tions. In such cases communication round can be ”peeled”out of the loop to enable overlap.

4.1.3 Limitations

• Finding Optimal Tiling, window parameters are hard.Larger tiling size may result in decrease overlap effi-ciency, higher pipeline start/drain time while, largerwindow size result in too many outstanding requeststhus more message matching overhead, congestion, etc

4.1.4 Dynamic/Hybrid Performance ModelingDynamic feedback based performance tuning and profiling

[11, 6, 7] have gained traction in recent years despite theirobvious drawback to runtime overhead. These approacheshave lead to on-the-fly adaptive learning models where mostaffluent predictors are chosen and inappropriate or inefficientmodels are automatically discarded.

Additionally hybrid coupling of dynamic methods withinformation gained by static transformations, such as onesdiscussed in previous section (4.1.2) [7], have resulted in amore powerful and a lesser overhead ( < 2% for some ap-plications [7]) learning process - with the ability to estimatecollectives performance in a high degree of accuracy. Thesestudies have reported techniques such as batched model up-dates and adaptive measurements to minimize the runtimeoverhead.

4.1.5 Limitations

• Runtime overhead - Dynamic profiling and tuning meth-ods incur considerable runtime overhead due to datacollection, model update and software tasks.

• Implementation difficulty - Not easy to prototype, anextensive knowledge on compiler transformations, dataflow and other optimizations are required.

4.2 One-sided communication for Optimiza-tion

One-sided communication has been the primary mode ofcommunication in Partitioned Global Address Space (PGAS)languages/runtime specifications such as UPC [12], Tita-nium [1], ParallelX [43] and has been lately integrated intothe second and third version of the Message-Passing Inter-face (MPI) standard which was driven mainly by the successof the communication model. Some studies [5, 42] have alsohighlighted the use of one-sided communication in band-width bound application, by using techniques of communi-cation computation overlap.

4.2.1 Remote Direct Memory Access (RDMA)RDMA based one-sided communication model was popu-

larized by the emergence of number of network technologiesin high performance Interconnects arena. However RDMAhas its predecessors in U-Net [84], a customizable NetworkInterface Architecture and VIA (or Virtual Interface Archi-tecture) which started as a low overhead high throughput al-ternative for 2-sided network communication models. U-Netshowcased the first glimpse of the potential of fully customiz-able Network Interfaces which was able execute offloadedcode as well as directly interact with user level buffers for ef-ficient network operations. U-Net managed a interface for of”transmit”, ”receive” and ”free” buffers [84] for pinned DMAaccess is similar to the registered memory found in RDMAof Infiniband [4] and friends (However the number of regis-tered buffers are predetermined and one free slot was pickedfrom the queue for a transfer).

Unlike the 2-sided messaging model RDMA based one-sided communication doesn’t require a rendezvous from theremote side therefore freeing any processor resources there.Furthermore RDMA also mitigate the message matchingand sometimes unnecessary message ordering overheads [5]present in 2-sided protocols. The primary motivation be-hind one-sided RDMA model is to separate the data move-ment from synchronization. It offers substantial benefits inreducing costs associated with network operations with one-sided programming models [5]. Specifically RDMA avoidsany synchronization and message matching cost present indata transfers of a rendezvous protocol where data transferlatency can be negligible compared to other overheads.

Infiniband Architecture specification [4] defines 2 prin-cipal types of transport operations. a) SEND/RECV - atwo-sided traditional rendezvous b) RDMA - one-sided di-rect with READ/WRITE/ATOMIC operations. The formermode, requires explicit synchronization from both sides ofthe transfer, where a matching procedure is executed by theNIC/HCA to figure out the source and destination bufferaddresses to initiate the transfer. One implication of this2-sided transaction is that late posting of receives will beconsidered a fatal error in Infiniband RDMA and thereforespecial handling is required for such ”unexpected” messages,provided pre-posting of ”RECV” operations are NOT guar-anteed.

The later mode - RDMA one-sided operations need to beinitiated and handled by only one end (for example senderfor WRITE or receiver for READ operation) and the initia-tor need to possess all information about source and destina-tion buffers and relevant protection keys before the initiationof operation. Hence as expected RDMA operations semanti-cally matches best with one-sided programming models suchas PGAS (partitioned global address space). Each RDMAconnection is abstracted on hardware by a entity called a”Queue Pair” or ”QP” (pair because 2 queues - send, re-ceive) and each operation generates a Work Queue Element(WQE) on the respective queue. The events correspond-ing to a transaction completion is pushed to a special queuetype called completion queue or ”CQ”, which can be polledby applications to handle the messages appropriately. Fig-ure [?] is a simplified depiction of a RDMA operation (bothrendezvous and one-sided are shown).

4.2.2 RDMA and CollectivesAs discussed in section 2, Collective operations usually

(apart from hardware assisted/offloaded collectives) consistof many number of point to point communications. Even-though it is clear that for point to point communicationRDMA is quite useful for latency and bandwidth improve-ment [5], its efficiency is not apparent for collective com-munication. Many studies [77, 78, 52, 47, 63] have in-factreported benefits of using RDMA for collectives in infini-band enabled clusters and showcased considerable latencyand bandwidth gains over traditional two-sided communi-cation. Based on aforementioned work, we categorize howRDMA communication paradigm can optimize the perfor-mance of collectives under following sections.

A. Direct Network Operations.In many modern RDMA interconnect architectures sup-

port direct remote memory access without the interventionof remote host cpu. This mitigate any kernel overhead present

Figure 1: RDMA communication for a Rendezvous SEND/RECV and WRITE operation 1. A receiver posta Rendezvous RECV on remote QP and a sender posts a Rendezvous SEND and a RDMA write request toits QP 2. Work requests are handed over the wire to the remote end 3. Remote HCA/NIC matches theRendezvous request 4. RDMA transfer is initiated by the remote end according to the requests received 5.Completion request is posted on remote end’s CQ (depends on a request being a signaled transaction) 6. Ifan ACKs are produced local CQ is updated for respective operations

(context switchs, traps) in traditional network stack and im-mediately release any extra cpu cycles which can be directlycontributed to running applications and programs. Further-more most of the messaging middle-ware has a plethora ofinternal layers that data has to navigate through from thepoint of it was initiated until the underlying hardware ismet. The MVAPICH [77] point-to-point communication isbased on a software layer called the ADI (Abstract DeviceInterface) and OpenMPI is based on many transport anddata transfer layers in mca architecture [74] that providesinterfaces to port different interconnects and consequentlybuild different abstractions for collective operations on topof it. However these software layers notably add significantoverhead to collective communication. A common optimiza-tion is to reduce latency between caller and the underlyinghardware, by bypassing intermediate layers entirely by uti-lizing low level drivers/libraries from the user/kernel space.For example many MPI implementations including MPICH,MVAPICH and OpenMPI use direct Infiniband verbs or in-finipath psm interface in collective libraries. LibNBC [39]also has a dedicated OFED (OpenFabrics Enterprise Distri-bution ) (although based on SEND/RECV rendezvous) anda IBM driver for collectives apart from the standard highlevel MPI interface. Specialized low overhead libraries likeGasNet, Gravel, Photon can also provide this functionalityefficiently.

B. Zero Copy Transfers .Traditional MPI point to point operations implement ea-

ger protocols that inline message payload with the header in-formation. While this makes efficient data transfer for smalland medium sized messages, it also presents an overhead for

message copying. That is due to the fact that, for each eagertransfer runtime system must copy from an RDMA buffer tothe user buffer, causing a scalability bottleneck for messagesize. RDMA can circumvent this issue by direct memorywrites to user buffers also known as ”Zero copy” transfers.RDMA also mitigates any message matching software over-head at the remote end because the transfer is performedwithout the involvement of the remote end. However in or-der to utilize RDMA, the source process needs to know thedestination memory address which can be inlined via a writerequest header or completion entry (CQE) or the remote endtransfer it to the initiator.

C. Pre-Registered Buffer Copies.RDMA Zero Copy transfers can be an efficient way to

write to large collective buffers, because it amortize the costof collective user buffer registration and receiver buffer ad-dress transfer cost over a large memcopy operations. How-ever sometimes for small messages this protocol might not bethe best possible option since the registration and transferaddress cost can be much larger. Therefore many collectivelibraries register a subset of small buffers beforehand (usu-ally at the init() time) and keep them as internal scratchspace for intermediate copying.

D. Optimizing Rendezvous protocols .Zero copy based Rendezvous protocol is commonly adapted

for collective operations involving large messages. A remoteside participating in a traditional rendezvous protocol needto send buffer addresses for each segment transfer operation.However this is not suitable for a collective operation sincecomposite point to point operations can be large and redun-

dant with respect to buffer addresses. Thus caching of bufferaddresses and base address manipulation for subsequent it-erations are performed so that RDMA can be directly usedwithout any need for address exchange.

E. Optimizing Registration .The relative software and hardware cost of an typical

RDMA buffer registration has the highest latency cost, com-monly around 90+us [77]. Therefore a large number of pointto point operations inside a collective algorithm can be asignificant overhead if a buffer registration is carried outfor each iteration of the collective. A common optimizationtechnique is to reduce the buffer registration cost by regis-tering required buffer addresses in the confines of a singleroutine (this is possible because for a collective buffers areknown in advance). Furthermore pre-registration of pipelinebuffers may be beneficial for small message (amortizing costfor copying) based collectives.

F. Collective Offloading .Network Offloading was a concept which was first invented

by U-Net [84], by the advent of next generation prototypesand cutting edge programmable features in them. Collec-tives can now be driven entirely by the network cards, re-lieving host cpu cores of most of the software overhead inprogression and management of collective operations. Thishas also been coined by the term collective offloading. Sev-eral studies [22, 76, 32] have experimented with fully asyn-chronous collective schedules running on Portals 4 NetworkInterface and ConnectX-2 InfiniBand managed queues [32]with very desirable results. Such offloading strategy mayalso provide greater communication and computation over-lap by enabling true asynchronous collectives.

4.2.3 Limitations

• Overhead of Zero Copy RDMA - to use direct RDMAwith zero copying in collectives either sender or re-ceiver would need to acquire the respective source anddestination buffer information. As mentioned beforecost of such synchronization coupled with memory reg-istration cost is significant w.r.t network latency, andtherefore is not a viable option for small and mediumsized collectives even with aforementioned optimiza-tions in place.

• Difficulty to find optimal switching points - As de-scribed above a good optimization strategy in-orderto alleviate some of the shortcomings of RDMA trans-actions is to switch between zero copy and non zerocopy protocols. However finding the sweet spot is notnecessarily a trivial task, many factors including thecollective algorithm dictates which protocol to be used.A good example in AlltoAll collective algorithms in re-cursive doubling vs ring implementations can be foundin [77]. Furthermore other factors on interconnectsuch reliable or unreliable transport, signaled or mem-ory polled completion, registered pipeline size, etc candominate the protocol efficiency, hence the switchingpoint. Therefore as reported by many implementations[77, 78, 47] protocol setting on hard coded switchingpoints would not necessarily produce optimal collectiveperformance.

5. DISCUSSIONPrevious sections have emphasized on the methods for

Tuning and optimizing collectives in various contexts andalso their relative strengths and weaknesses. It is evidentthat there is no single method that outshine or advantageswhich may significantly outweigh from rest. Furthermorenone of the methods take into consideration all possible fea-ture sets that can be involved to solve a specific collectivetuning problem, thus they most often generate a predictorfunction that may not be powerful enough to predict appli-cation execution at very large scales (over-fitting or under-fitting). Most methods focus on a handful of features that issubjective to a particular performance optimization context.For example algorithm type , segmentation and message sizeare major concerns for a Decision Tree based classifica-tion criteria or network latency, bandwidth and overheadfor an analytical solution. Additionally any of the tech-niques discussed thus far, fails to evaluate the performancecorrelation at scale between parts/phases of an applicationinvolving collective operations and thus unable identify per-formance bottlenecks and bugs in certain parts of the appli-cation domain. Therefore building up a high performancemultidimensional predictor function that can address theseconcerns will benefit and speedup the collective optimizationprocess greatly.

5.1 A Unified Collective Tuning ArchitectureTable 4 highlights the properties found in the collective

tuning methods discussed in this paper. We understand thatcollective tuning is generally performed under two method-ologies. First is by selecting an existing application, a kernelor a benchmark involving collectives and directing its run-time profile information into a function, that utilize somealgorithm to output the predicted optimal performance pa-rameters for the given input. For example this involves gen-erating some form of a learning model or a predictor functionby capturing samples of input measurement data and pro-ducing an output which can be a identified as a performancemetric such as throughput, latency, rank or some class withone or more parameters such as a collective algorithm andsegment size combination. Second method involves man-ual or hand tuning of collective runtime system in-terms ofunderlying communication libraries used, by the use of lowlevel optimization efforts on the respective computation andcommunication methods especially on network interconnectstack. An example scenario is the efforts to short circuitnumber of software layers from the collective invocation to-wards the actual hardware to shorten the critical path orby the use of programming models and semantics such asnon-blocking and one-sided communication that can reduceoverhead incurred.

Interestingly, compiler based techniques such as CCTPand Loop nesting discussed under section 4 may fall intoeither one of the aforementioned methods. Traditional ap-proaches have focused on loop nesting, tiling and other trans-formation techniques that demand tedious manual inspec-tion of respective code regions and then profiling applicationfor the optimal set of parameters. However static analysismethods such as automatic affine loop detection [7] , dataflow analysis and PDG (program dependence graph) [27,54, 55] methods which are aided by modern day compilershave shed light on benefits of loop modeling and automat-ically generating predictor functions based on the program

Metric AnalyticalModel-ing forstandardcollectivecommuni-cation

EmpiricalEstima-tion

GraphicalCom-pressionMethods

DecisionTrees

NeuralNetworks

Dynamicfeedbackbasedmodels

Compiler/RuntimeTech-niques

Require a densedata set?

No Yes Yes Yes Yes No N/A(compilerscan assistin auto-mated codegeneration)

Time for parameterestimation/modelgeneration

Moderate N/A(no param-eters)

Moderate High High High Moderate

Mean PerformancePenalty(time difference forexperimental peakvs predicted)

High Low Low Low Low High(will even-tuallysettle intoa accept-able Lowvalue)

Low

Runtime Overhead(time for predic-tion)

Minimum /Zero

High(subjectedto con-vergencerate)

Low(subjectedto encodingscheme)

Low Low Very High Low

Model Accuracy(Unseen data)

RelativelyLow(over-fitting/under-fitting)

Low(over-fitting)

Moderate(dependson datasparseness,shape anddimension-ality)

High(subjectedto modeltraining)




Implementationdifficulty

High(Requiresdetailedunder-standingof applica-tion andalgorithms.Difficult toautomate.)

Low High(Efficientencodingschemesand re-quired datastructuresfor featuresare hard todesign.)

Low Low Low Moderate(Extensiveknowl-edge oncompilers,control anddata-flowrequired.)

Table 4: Summary of Techniques used for Collective Tuning and Optimization

input. As indicated by Table 4, automated compiler tech-niques doesn’t require a dense data set since compiler canself model its application or code region/s such that it canmodel loops itself to optimize some performance metric suchas communication and computation overlap.

Furthermore the summary from Table 4 suggests that su-pervised machine learning algorithms such as Decision trees,Neural Networks and Support Vector Machines (SVM) ingeneral have satisfactory accuracy rate provided that themodel training algorithm is properly performed with an ap-propriate sample quantities (a dense data set) and accept-able ratios of training, test and cross validation set. Propermodel training mechanism will ensure with high probabil-ity that the trained model will not over-fit (specialize) to

a particular data set or it will not underfit (over general-ize) to data. However traditional Analytical models ( LogP,LogGP, Hockney, PRAM) and Empirical Estimation tech-niques (AEOS) tend to have a lower prediction accuracy(on unseen test data) due to inherent over-fitting to sampledata. Besides, designing an analytical model would gener-ally require detailed understanding of the underlying collec-tive communication patterns and other related aspects forperformance improvement. Therefore such methods mighttake the longest time to prototype and test, which is notpreferable from developers perspective.

Nevertheless analytical and statistical models do have anadvantage of requiring lesser number of data points to buildup a decision function. This is because model parameters

Figure 2: Overview for a Unified Multidimensional Tuning Architecture for Collectives (UMTAC)

finding is equivalent to solving a smaller number linear equa-tions, provided that the analytical solution models the realworld behavior. Furthermore unlike some machine learningmodels such as Neural networks where the model is implicit,Analytical model provide the direct functional (mathemat-ical) representation for portraying a direct view of perfor-mance critical features which makes it easy to find scalabil-ity bottlenecks. Similarly all methods we subjected to ouranalysis most often than not excel in some aspect or otheras being further elaborated by Table 4. It is impossiblehowever to single out one technique that can supersede therest in all possible outcomes. This may be further impliedby the fact that some tuning techniques such as compiler/-code transformation and runtime based methods are uniqueand essential in functionality for collective optimization suchthat it provides the basis for optimal parameter search andclassification problems we discussed earlier (note that run-time optimization is a broad category which also includesdesign of generalized or platform specific optimal algorithmsfor collectives). Thus importance of a hybrid approach toselect suitable methods is very much necessary to leveragedistinct advantages each tuning technique provides.

It is important to note that context of each method con-tributes to a unique feature space being utilized. Classifi-cation schemes on collectives primarily operate on a inputparameter space of collective operation, number of proces-sors, message size and output parameter space of collectivealgorithm , segment size. However in the case of optimiz-ing application kernels with collective operations, this inputfeature space can be extended by additional considerationssuch as ratio between communication to computation, pointto point and collectives, tiled, window and other overlap spe-cific parameters, etc. In standard analytical models inputparameter space can become a combination of number ofprocesses, bandwidth, overhead, latency while output spacemay be fixed to a single dimension, ”total elapsed time”.

Furthermore In the case of static analysis based compilerand runtime optimization, input space would consist of looptransformation parameters and low level library specific con-figurations, respectively.

All of the aforementioned tuning methods discussed thusfar takes into account either an individual collective opera-tion via a benchmark or the entire application as a whole.In both these cases, tuning methods were focused only atone or at most couple of performance optimization contextssuch as figuring out the optimal collective algorithm and seg-ment size for a specific collective benchmark. Such scenariowould soon become a optimization nightmare if a user/de-veloper wants to optimize collectives in a application underdifferent performance contexts. One issue would be the com-plexity involved with parallel optimization contexts due tothe inherent correlation that might be present in seeminglyunrelated contexts. For example one specific collective al-gorithm and optimal segment size may not be best possibleoption under a specific transport parameter such as eagerRDMA buffer size which was independently determined. Insuch cases best results would be obtained if both parametersearch problems were coordinated together by the likes of aunified parameter search algorithm. Moreover by combiningdata from different features we allow to train a model that isgeneralized enough to estimate performance at many unob-served instances and scales. Thus we propose a unified mul-tidimensional tuning architecture for collectives (UMTAC, Figure 2) that can yield better results by unifying all possi-ble feature space via a systematic benchmark executor andcombining different learning algorithms such as Linear re-gression, Neural Networks, Decision Trees, etc to generate ahigh performance predictor function.

To model a multidimensional unified predictor function, areleavant input parameter space may consist of all possiblefeatures corresponding to a application or a kernel. Here afixed dimension in time ”t” is preferable as output because

it is the primary performance metric considered for most ofthe applications. UMTAC takes the liberty to train themodel in arbitary amount of time as necessary because, of-fline training usually doesn’t necessarily impact any applica-tion unless any strict milestones are imposed for productionpurposes. We would also like to consider an entire appli-cation in fine granular parts or kernels. UMTAC wouldoutput the performance estimation function in terms of ker-nel index and feature space as well as the relative orderingof kernels for a given set of input I. This has a distinct ad-vantage of being able to perform a surgical evaluation of acollective based application at different phases of executionand detect performance bugs at the very early stages of de-velopment and effectively reduce large scale costs involved inmitigating them at a later stage. Additionally rank based es-timation may also empower the user who can then focus hiseffort only on relevant parts of an collective application thatneeds optimization the most. A similar technique has beendiscussed in Wolf,et al [87] for HPC application related per-formance modeling in general. However their method looksonly at a single input predictor function to detect bugs atscale (with an input being number of processors) using linearregression as a tool. Finally UMTAC include a validationcomponent that may re-enforce with more runtime data tosuccessively improve the predictive power of previously gen-erated model.

In following sections we present an overview of the com-ponents of proposed in UMTAC architecture.

5.2 UMTAC ComponentsMajor components of proposed UMTAC architecture are

a) Application profile generator b) Benchmark Executor Frame-work c) Data pre-processor d) Model Generator e) ModelBoost f) Model Optimizer and g) Model Validator. Mainidea behind this architecture is to derive a unified predictorfunction with a set of performance features that has alreadybeen studied under various contexts and we believe wouldimpact a collective based application performance. Afore-mentioned predictor function would be powered by a generalbest practiced machine learning processes, models and algo-rithms that would successfully be able to produce accurateperformance estimates.

A. Application Profile Generator.The role of Application Profile Generator component will

be to perform necessary transformations tasks aided by com-pilers on application code, for enhancing communication andcomputation overlap with collectives [7, 17, 19], instrument-ing book-keeping code for kernels and profile management.Main functional components of profile generator can be cate-gorized into instrumentation tasks required for kernel detec-tion/management and post static analysis transformationsrequired for collective tuning as discussed in earlier sections(section 4.1.2).

B. Benchmark Executor Framework.Benchmark executor will act as the workhorse function for

facilitating necessary performance data to train ML (Ma-chine Learning) models. One of the key requirement isthe ability to efficiently enumerate and execute applicationthrough many possible enumerable (some of the parame-ters such as system parameters are non configurable) run-time and program input parameter space according to some

known distribution. However goal is to build a sufficentlycomplex predictor such that it does NOT require a densedata set. Another requirement is that user should be able toprovide a specification for each input parameter (both pro-gram, runtime and environment specific) that will describe aparameter’s type information such as name, discrete/contin-uous, data-type and value type information such as range,enumerator type.

C. Data pre-processor.This component will require to prepare any data set for

sanity checking (outliers and invalid data), and pre-processingbefore being subjected to any kind of learning algorithm.Each learning algorithm may have specific constraints im-posed on the training set to reduce the performance over-head in the training phase. For example skewness in data isa critical issue in gradient decent type of algorithms (for re-gression, etc) which may result in sub optimal performance.In such cases any of the transformations such as re-scaling,mean normalization, standardization, will be performed oninput data.

For example standardization (z-score) can be implemented

by following criteria for a sample dataset D; Uin =(Ui − µi)

σiUi ∈ D.

D. Model Generator.The role of Model generator is to generate a reasonably

accurate model that fits the data as well as act as an esti-mation function for unseen data through feature learning.UMTAC will use ”Multivariate Linear regression” [66] as atool for generating the base learning model. The reasonfor using Multivariate linear regression is two fold. One isthat linear regression can model any multidimensional fea-ture space with sufficiently high accuracy and performanceby eliminating bias and noise effects. Secondly it can ef-fectively represent the collective tuning problem such thattechniques like regularization and dimensionality reductioncan filter the redundant or unwanted features that doesn’tcontribute towards performance. For example linear regres-sion can include any generic function of a representative per-formance feature such as number of processes p into a ac-ceptable form such as (pi ∗ log(p)j), which is function of pthat can be approximated by the studies of analytical com-munication models (Table 3 ).

One of the major challenges of linear regression is to searchfor the best fit model. UMTAC will need to generate mul-tiple regression models to determine select the best amongthem with the highest accuracy. Thus designing, trainingand searching for best can take considerable amount of of-fline time. However, once a best fit representation is gen-erated, estimations can easily be formulated to figure outperformance bugs in collective kernels and also to prioritizeonly on selected set of features that would affect the per-formance most. Thus linear regression provides a succinctmathematical representation for the collective tuning prob-lem which is easier to express, manipulate and debug.

As discussed throughout this paper there can be many fea-tures/factors that can dominate the performance of collec-tive based applications/kernels. We believe that the numberof processes p, is a core factor in any collective optimizationproblem and therefore it may be included into the model asa base feature. The rest of the features are arbitary and will

be included based on the support extended from runtime,application specific program input, library extensions andprofile generator (code transformations). It is important tonote that there is no upper bound on the number of featuressupported, but following is a generic list of features that cangenerally be present.

• Collective benchmark features - number of pro-cesses (p), message size (m), segment size (s), collec-tive algorithm index

• Application/kernel specific features - number ofcollective operations, number of point to point opera-tions

• Profile generator features - tile size, window size,number of nested loops, pipeline length, loop decom-position states (enum)

• Runtime specific - eager message size, transport op-tions (for MPI - btl, mtl, etc), collective algorithm,collective segment size

• RDMA/low level network specific - ledger size,Work queue size, number of QPs, number of WQ re-quests, eager message size, RCQ enabled states (enum),other optimization states (enum)

• System specific - Memory allocator (enum), Pagesize, noise thresholds/frequency, PAPI counts on cache/TLB/ etc

To define our base multivariate regression function, let’sassume that we have a total feature set U, defined by twoindependent sets P and R. The set P is the features definedas a function of number of processes p. Set R is a combi-nation all other features present in the input data set. Thisdistinction is made since we know number of processes isa commonly occurring dimension in many analytical mod-els. More importantly since we have a fair understandingon the growth of number of processes, which can range froma n degree polynomial to a log based one such as P log(P),this will model a simple regression function with sufficientaccuracy with respect to parameter p.

The above description defines the sets U, P and Q asfollows.

U = {u1 , u2 , u3 .... , un} = P ∪R | where |U| = n

P is defined as following, Here P is a symbolic set

while M, N ⊆ Q

P =

n⋃i=1

m⋃j=1

PNi logMj P | where m = |M | ;n = |N |

Let X be the feature set excluding any p terms

Let function g(X,n) be any valid transformation that generates

some polynomial expression of order n for input symbol set X

X = {f1 , f2 , f3 .... , fk}

R =

k′⋃i=1

g(Xi, n) | where ;Xi ⊆ X

such that k′ ≤ |X| andk′∑i=1

|Xi| = |X| and

k′⋂i=1

Xi = ∅

We can then formulate the linear regression problem bythe following, with the variable expression set U in place.We base our regression hypothesis (hθ(U)) on the parameterset θ ∈ Rn+1.

θ = {θ0 , θ1 , θ2 .... , θn} | where |θ| = n+ 1

then;

hθ(U) = θ0 + θ1.u1 + θ2.u2 + θ3.u3 + .... + θn.un

hθ(U) = θT .U′ | where U′ =

[1U

]The cost function J(θ) for the linear regression predic-

tor function can be formulated with least squares estimates.However one problem with a significantly large feature spaceis the increasing model complexity. Although a complexmodel can work very well for a training data set, it caneventually lead to over-fitting problem. Additionally thereexist a high probability that input data is correlated in thefeature vector. In order to avoid such issues and normalizethe impression of each input feature on the model, ”ModelGenerator” component will associate a regularization com-ponent for the regression model. For regularization generallya L1 norm component is preferred over L2 [53]. Thus thecost function with regularization for minimization objectiveis formulated by the following equation.

Let m = |D| be the size of data set D

Let λ ∈ R be the regularization coefficient then ;

J(θ) =1

2m

m∑i=1

(hθ(u(i))− y(i))2 + λ.L(θ)

Minimizing objective function J(θ) can be achieved via bothanalytically and using a numerical iterative method such asgradient descent. Even though former (ex. analytical) is afast and convenient technique, it can suffer from the matrixnon-invertability problem, thus generally iterative algorithmsuch as gradient descent is utilized [66]. It is important tonote that the regularization coefficient λ is an arbitrary valuewhich is best determined via a experimentation techniquesuch as cross validation.

E. Model Boost.Our observation of existing ML techniques utilized in col-

lective tuning problem is that, all learning models are able toachieve a satisfactory performance and accuracy rate if therespective models are trained and supervised to an accept-able limit. Therefore we can conclude that all methods havethe potential to become a strong predictor. For examplenew generations of Artificial neural networks such as CNN(Convolutional neural networks) and RNN (Recurrent Neu-ral networks) have recently become the forerunner amongthe highest performing learning algorithms and the de-factostandard in image classification research.

Furthermore each learning model may differ in their prop-erties that may lead to different performance characteristics.As an example some versions of ANN would perform muchbetter at learning non linear functions for a very large inputfeature set than a regression or decision tree based predictor.Thus combining or ”ensembling” different predictors withUMTAC base regression model may produce better perfor-mance. In practice ”Ensemble methods” [67, 88] bagging and

boosting have been known to produce better results [88] byaggregating sampled datasets and predictors to generate anormalized hybrid predictor function. Thus UMTAC ModelBoost component can utilize these algorithms to combineother predictors with its base regression model.

F. Model Optimizer.The main objective of UMTAC optimizer is to speedup the

training and running time of the system. One of the waysto optimize the model is by feature reduction, also knownas ”dimensionality” reduction. PCA (Principal componentanalysis), Factor Analysis [66] and Multidimensional Scaling[50] are among common algorithms that can detect the cor-relation among selected input features and scale, rotate ortransform to a set of new reduced dimensions. AdditionallyUMTAC optimizer may consider other optimizations suchas efficient compression techniques to storage and retrievalof model data, reducing memory and I/O bottleneck,etc. Itwill also interact with other components of the system in afeedback loop to capture necessary information for iterativeoptimization.

G. Model Validator .A validator is required by UMTAC design to ensure that

the trained model performs according to the expected stan-dards for a given test data set T. In order to achieve this,user can provide it either an upper bound limit or a thresh-old function tr(T) with an Input test data set. If the out-put blows up this threshold such that tr(T) < g(T)) then,model will be subjected to further refinement. If the ex-pected performance is below the threshold value then, itwill direct output to ”Reactor Core” component.

G. Reactor Core .Proposed functionality of this component is two fold.

• predict performance - has the ability to generate theperformance estimator functions g(ki,U) indexed byapplication kernels ki for a optimal parameter set U. Ifthere are q kernels in the collective application then to-tal performance estimate is evaluated as

∑qi=1 g(ki, U)

• extrapolate optimal parameter values - using UMTACperformance model and the given input parameter setU, it searches for optimal parameter values for a givennumber of processors or a parameter combination V.Reactor component will be responsible for performinga parameter sweep through the enumerable parameterset V ⊆ U using an appropriate technique such asGradient descent, or an appropriate variant.

6. FINAL REMARKSWe have summarized the main research streams on meth-

ods for collective communication tuning, under statistical,empirical, machine learning, data mining, compiler and run-time based performance optimization contexts. Such meth-ods are the focus of research work spanning large numberof tooling frameworks, runtimes and applications such thatour review could not be exhaustive. However we managedto highlight the breadth of these methods with static anddynamic categorization along with their relative strengthsand weaknesses.

Fundamental issues related to existing collective tuningproblem are the feature explosion that originates from evolv-ing HPC technology and the resultant diversity of tuningmethods that cater to individual performance contexts. Thisinability to converge to a unified process of tuning collec-tives has ensued a generation of poor predictor models forcollective operations that may only produce best results on agiven optimization context. Moreover, they will fail to pro-duce good performance estimates for collectives based ap-plications at future exascale execution, given the fact thatexhaustive training of predictor functions may not be a read-ily available choice.

There are several existing work on hybrid predictor func-tions presented in this paper, which may mitigate some ofaforementioned issues highlighted. However we envisage fur-ther possibilities with a novel UMTAC architecture proposedin this paper. We believe it can address collective featureexplosion problem by combining as many possible input aspossible while eliminating any correlated and irrelevant fea-tures among them. Ultimate goal of the proposed architec-ture is to produce a strong predictor function for collectiveapplications by coalescing best of learning models and uti-lizing best practices in machine learning and data miningdomains. Finally, our architecture may also enable users toinvestigate a collective application/kernel in any granularityto detect performance bottlenecks at early stages of proto-typing which is critical for rapid application development.

7. REFERENCES[1] A. Aiken, P. Colella, D. Gay, S. Graham, P. Hilfinger,

A. Krishnamurthy, B. Liblit, C. Miyamoto, G. Pike,L. Semenzato, et al. Titanium: A high-performancejava dialect. Concurr. Comput, 10(11-13):825–836,1998.

[2] A. Alexandrov, M. F. Ionescu, K. E. Schauser, andC. Scheiman. Loggp: incorporating long messages intothe logp modelaATone step closer towards a realisticmodel for parallel computation. In Proceedings of theseventh annual ACM symposium on Parallelalgorithms and architectures, pages 95–105. ACM,1995.

[3] M. N. Anyanwu and S. G. Shiva. Comparativeanalysis of serial decision tree classification algorithms.International Journal of Computer Science andSecurity, 3(3):230–240, 2009.

[4] I. T. Association et al. InfiniBand ArchitectureSpecification: Release 1.0. InfiniBand TradeAssociation, 2000.

[5] C. Bell, D. Bonachea, R. Nishtala, and K. Yelick.Optimizing bandwidth limited problems usingone-sided communication and overlap. In Proceedings20th IEEE International Parallel & DistributedProcessing Symposium, pages 10–pp. IEEE, 2006.

[6] A. Bhattacharyya and T. Hoefler. Pemogen:Automatic adaptive performance modeling duringprogram runtime. In Proceedings of the 23rdinternational conference on Parallel architectures andcompilation, pages 393–404. ACM, 2014.

[7] A. Bhattacharyya, G. Kwasniewski, and T. Hoefler.Using compiler techniques to improve automaticperformance modeling. In 2015 InternationalConference on Parallel Architecture and Compilation(PACT), pages 468–479. IEEE, 2015.

[8] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel.Optimizing matrix multiply using phipac: a portable,high-performance, ansi c coding methodology. In ACMInternational Conference on Supercomputing 25thAnniversary Volume, pages 253–260. ACM, 2014.

[9] D. Bonachea. Gasnet specification, v1. 1. 2002.

[10] G. Bronevetsky. Communication-sensitive staticdataflow for parallel message passing applications. InProceedings of the 7th annual IEEE/ACMInternational Symposium on Code Generation andOptimization, pages 1–12. IEEE Computer Society,2009.

[11] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Usingautomated performance modeling to find scalabilitybugs in complex codes. In Proceedings of theInternational Conference on High PerformanceComputing, Networking, Storage and Analysis,page 45. ACM, 2013.

[12] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick,E. Brooks, and K. Warren. Introduction to upc andlanguage specification. Technical report, TechnicalReport CCS-TR-99-157, IDA Center for ComputingSciences, 1999.

[13] M. Chaarawi, J. M. Squyres, E. Gabriel, and S. Feki.A tool for optimizing runtime parameters of open mpi.In European Parallel Virtual Machine/Message

Passing Interface UsersaAZ Group Meeting, pages210–217. Springer, 2008.

[14] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E.Schauser, E. Santos, R. Subramonian, andT. Von Eicken. Logp: Towards a realistic model ofparallel computation. In ACM Sigplan Notices,volume 28, pages 1–12. ACM, 1993.

[15] A. Danalis, A. Brown, L. Pollock, M. Swany, andJ. Cavazos. Gravel: A communication library to fastpath mpi. In European Parallel Virtual

Machine/Message Passing Interface UsersaAZ GroupMeeting, pages 111–119. Springer, 2008.

[16] A. Danalis, K.-Y. Kim, L. Pollock, and M. Swany.Transformations to parallel codes forcommunication-computation overlap. In Proceedingsof the 2005 ACM/IEEE conference onSupercomputing, page 58. IEEE Computer Society,2005.

[17] A. Danalis, L. Pollock, and M. Swany. Automatic mpiapplication transformation with asphalt. In 2007 IEEEInternational Parallel and Distributed ProcessingSymposium, pages 1–8. IEEE, 2007.

[18] A. Danalis, L. Pollock, M. Swany, and J. Cavazos.Implementing an open64-based tool for improving theperformance of mpi programs. 2008.

[19] A. Danalis, L. Pollock, M. Swany, and J. Cavazos.Mpi-aware compiler optimizations for improvingcommunication-computation overlap. In Proceedingsof the 23rd international conference onSupercomputing, pages 316–325. ACM, 2009.

[20] D. Das, M. Gupta, R. Ravindran, W. Shivani,P. Sivakeshava, and R. Uppal. Compiler-controlledextraction of computation-communication overlap inmpi applications. In Parallel and DistributedProcessing, 2008. IPDPS 2008. IEEE InternationalSymposium on, pages 1–8. IEEE, 2008.

[21] L. D. de Cerio, M. Valero-GarcıI ↪Aa, and A. Gonzalez.A method for exploiting communication/computationoverlap in hypercubes. Parallel Computing,24(2):221–245, 1998.

[22] S. Di Girolamo, P. Jolivet, K. D. Underwood, andT. Hoefler. Exploiting offload enabled networkinterfaces. In 2015 IEEE 23rd Annual Symposium onHigh-Performance Interconnects, pages 26–33. IEEE,2015.

[23] T. G. Dietterich and E. B. Kong. Machine learningbias, statistical bias, and statistical variance ofdecision tree algorithms. Technical report, Technicalreport, Department of Computer Science, OregonState University, 1995.

[24] G. E. Fagg, J. Pjesivac-Grbovic, G. Bosilca,T. Angskun, J. Dongarra, and E. Jeannot. Flexiblecollective communication tuning architecture appliedto open mpi. In Euro PVM/MPI, 2006.

[25] A. Faraj and X. Yuan. Automatic generation andtuning of mpi collective communication routines. InProceedings of the 19th annual internationalconference on Supercomputing, pages 393–402. ACM,2005.

[26] A. Faraj, X. Yuan, and D. Lowenthal. Star-mpi: selftuned adaptive routines for mpi collective operations.In Proceedings of the 20th annual internationalconference on Supercomputing, pages 199–208. ACM,2006.

[27] J. Ferrante, K. J. Ottenstein, and J. D. Warren. Theprogram dependence graph and its use in optimization.ACM Transactions on Programming Languages andSystems (TOPLAS), 9(3):319–349, 1987.

[28] R. A. Finkel and J. L. Bentley. Quad trees a datastructure for retrieval on composite keys. Actainformatica, 4(1):1–9, 1974.

[29] M. Frigo and S. G. Johnson. Fftw: An adaptivesoftware architecture for the fft. In Acoustics, Speechand Signal Processing, 1998. Proceedings of the 1998IEEE International Conference on, volume 3, pages1381–1384. IEEE, 1998.

[30] E. Gallardo, J. Vienne, L. Fialho, P. Teller, andJ. Browne. Mpi advisor: a minimal overhead tool formpi library performance tuning. In Proceedings of the22nd European MPI Users’ Group Meeting, page 6.ACM, 2015.

[31] G. Gopalakrishnan, R. M. Kirby, S. Siegel, R. Thakur,W. Gropp, E. Lusk, B. R. De Supinski, M. Schulz, andG. Bronevetsky. Formal analysis of mpi-based parallelprograms. Communications of the ACM, 54(12):82–91,2011.

[32] R. L. Graham, S. Poole, P. Shamis, G. Bloch,N. Bloch, H. Chapman, M. Kagan, A. Shahar,I. Rabinovitz, and G. Shainer. Connectx-2 infinibandmanagement queues: First investigation of the newsupport for network offloaded collective operations. InProceedings of the 2010 10th IEEE/ACMInternational Conference on Cluster, Cloud and GridComputing, pages 53–62. IEEE Computer Society,2010.

[33] R. Hecht-Nielsen. Theory of the backpropagationneural network. In Neural Networks, 1989. IJCNN.,International Joint Conference on, pages 593–605.

IEEE, 1989.

[34] R. W. Hockney. The communication challenge formpp: Intel paragon and meiko cs-2. Parallelcomputing, 20(3):389–398, 1994.

[35] T. Hoefler. Principles for coordinated optimization ofcomputation and communication in large-scale parallelsystems. ProQuest, 2008.

[36] T. Hoefler. Principles for coordinated optimization ofcomputation and communication in large-scale parallelsystems. ProQuest, 2008.

[37] T. Hoefler, P. Gottschling, and A. Lumsdaine.Leveraging non-blocking collective communication inhigh-performance applications. In Proceedings of thetwentieth annual symposium on Parallelism inalgorithms and architectures, pages 113–115. ACM,2008.

[38] T. Hoefler, P. Gottschling, A. Lumsdaine, andW. Rehm. Optimizing a conjugate gradient solverwith non-blocking collective operations. ParallelComputing, 33(9):624–633, 2007.

[39] T. Hoefler and A. Lumsdaine. Design,implementation, and usage of libnbc. Open SystemsLab, Indiana University, Tech. Rep, 8, 2006.

[40] T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Asurvey of barrier algorithms for coarse grainedsupercomputers. 2004.

[41] T. Hoefler and D. Moor. Energy, memory, and runtimetradeoffs for implementing collective communicationoperations. Journal of Supercomputing Frontiers andInnovations, 1(2):58–75, 2014.

[42] C. Iancu, P. Husbands, and W. Chen. Messagestrip-mining heuristics for high speed networks. InInternational Conference on High PerformanceComputing for Computational Science, pages 424–437.Springer, 2004.

[43] H. Kaiser, M. Brodowicz, and T. Sterling. Parallex anadvanced parallel execution model for scaling-impairedapplications. In 2009 International Conference onParallel Processing Workshops, pages 394–401. IEEE,2009.

[44] A. Karwande, X. Yuan, and D. K. Lowenthal. Cc–mpi:a compiled communication capable mpi prototype forethernet switched clusters. In ACM Sigplan Notices,volume 38, pages 95–106. ACM, 2003.

[45] T. Kielmann, H. E. Bal, and K. Verstoep. Fastmeasurement of logp parameters for message passingplatforms. In International Parallel and DistributedProcessing Symposium, pages 1176–1183. Springer,2000.

[46] T. Kielmann, H. E. Bal, and K. Verstoep. Fastmeasurement of logp parameters for message passingplatforms. In International Parallel and DistributedProcessing Symposium, pages 1176–1183. Springer,2000.

[47] S. P. Kini, J. Liu, J. Wu, P. Wyckoff, and D. K.Panda. Fast and scalable barrier using rdma andmulticast mechanisms for infiniband-based clusters. InEuropean Parallel Virtual Machine/Message Passing

Interface UsersaAZ Group Meeting, pages 369–378.Springer, 2003.

[48] E. Kissel and M. Swany. Photon: Remote memory

access middleware for high-performance runtimesystems. In Parallel and Distributed ProcessingSymposium Workshops, 2016 IEEE International,pages 1736–1743. IEEE, 2016.

[49] N. Koziris, A. Sotiropoulos, and G. Goumas. Apipelined schedule to minimize completion time forloop tiling with computation and communicationoverlapping. Journal of Parallel and DistributedComputing, 63(11):1138–1151, 2003.

[50] J. B. Kruskal. Multidimensional scaling by optimizinggoodness of fit to a nonmetric hypothesis.Psychometrika, 29(1):1–27, 1964.

[51] E. Lusk, S. Huss, B. Saphir, and M. Snir. Mpi: Amessage-passing interface standard, 2009.

[52] A. Mamidala, J. Liu, and D. K. Panda. Efficientbarrier and allreduce on iba clusters using hardwaremulticast and adaptive algorithms. In IEEE ClusterComputing. Citeseer, 2004.

[53] A. Y. Ng. Feature selection, l 1 vs. l 2 regularization,and rotational invariance. In Proceedings of thetwenty-first international conference on Machinelearning, page 78. ACM, 2004.

[54] K. J. Ottenstein. Data-flow graphs as an intermediateprogram form. 1978.

[55] K. J. Ottenstein. An intermediate program form basedon a cyclic data-dependency graph. MichiganTechnological University. Department of Mathematicaland Computer Science, 1981.

[56] S. Pellegrini, J. Wang, T. Fahringer, and H. Moritsch.Optimizing mpi runtime parameter settings by usingmachine learning. In European Parallel Virtual

Machine/Message Passing Interface UsersaAZ GroupMeeting, pages 196–206. Springer, 2009.

[57] J. Pjesivac-Grbovic. Towards automatic and adaptiveoptimizations of mpi collective operations. 2007.

[58] J. Pjesivac-Grbovic. Towards automatic and adaptiveoptimizations of mpi collective operations. pages70–79, 2007.

[59] J. Pjesivac-Grbovic, G. Bosilca, G. E. Fagg,T. Angskun, and J. J. Dongarra. Decision trees andmpi collective algorithm selection problem. InEuropean Conference on Parallel Processing, pages107–117. Springer, 2007.

[60] J. Pjesivac-Grbovic, G. Bosilca, G. E. Fagg,T. Angskun, and J. J. Dongarra. Decision trees andmpi collective algorithm selection problem. InEuropean Conference on Parallel Processing, pages107–117. Springer, 2007.

[61] J. Pjesivac-Grbovic, G. Bosilca, G. E. Fagg,T. Angskun, and J. J. Dongarra. Mpi collectivealgorithm selection and quadtree encoding. ParallelComputing, 33(9):613–623, 2007.

[62] J. Pjesivac-Grbovic, G. E. Fagg, T. Angskun,G. Bosilca, and J. J. Dongarra. Mpi collectivealgorithm selection and quadtree encoding. InEuropean Parallel Virtual Machine/Message Passing


[63] Y. Qian and A. Afsahi. Efficient shared memory andrdma based collectives on multi-rail qsnetii smpclusters. Cluster Computing, 11(4):341–354, 2008.

[64] D. Quinlan. Rose: Compiler support forobject-oriented frameworks. Parallel ProcessingLetters, 10(02n03):215–226, 2000.

[65] R. Rabenseifner. Automatic profiling of mpiapplications with hardware performance counters. InEuropean Parallel Virtual Machine/Message Passing


[66] A. C. Rencher. Methods of Multivariate Analysis.North-Holland, 2012.

[67] L. Rokach. Pattern classification using ensemblemethods, volume 75. World Scientific, 2010.

[68] J. C. Sancho, K. J. Barker, D. J. Kerbyson, andK. Davis. Quantifying the potential benefit ofoverlapping communication and computation inlarge-scale scientific applications. In SC 2006Conference, Proceedings of the ACM/IEEE, pages17–17. IEEE, 2006.

[69] J. C. Sancho and D. J. Kerbyson. Improving theperformance of multiple conjugate gradient solvers byexploiting overlap. In European Conference on ParallelProcessing, pages 688–697. Springer, 2008.

[70] S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay,A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman.The lam/mpi checkpoint/restart framework:System-initiated checkpointing. International Journalof High Performance Computing Applications,19(4):479–493, 2005.

[71] M. Schulz and B. R. De Supinski. Pn mpi tools: Awhole lot greater than the sum of their parts. InProceedings of the 2007 ACM/IEEE conference onSupercomputing, page 30. ACM, 2007.

[72] D. R. Shires and L. Pollock. Program flow graphconstruction for static analysis of explicitly parallelmessage-passing programs. Technical report, DTICDocument, 2000.

[73] Q. O. Snell, A. R. Mikler, and J. L. Gustafson.Netpipe: A network protocol independentperformance evaluator. In IASTED internationalconference on intelligent information management andsystems, volume 6. Washington, DC, USA), 1996.

[74] J. M. Squyres and A. Lumsdaine. The componentarchitecture of open mpi: Enabling third-partycollective algorithms. In Component Models andSystems for Grid Applications, pages 167–185.Springer, 2005.

[75] M. M. Strout, B. Kreaseck, and P. D. Hovland.Data-flow analysis for mpi programs. In 2006International Conference on Parallel Processing(ICPP’06), pages 175–184. IEEE, 2006.

[76] H. Subramoni, K. Kandalla, S. Sur, and D. K. Panda.Design and evaluation of generalized collectivecommunication primitives with overlap usingconnectx-2 offload engine. In 2010 18th IEEESymposium on High Performance Interconnects, pages40–49. IEEE, 2010.

[77] S. Sur, U. K. R. Bondhugula, A. Mamidala, H.-W.Jin, and D. K. Panda. High performance rdma basedall-to-all broadcast for infiniband clusters. InInternational Conference on High-PerformanceComputing, pages 148–157. Springer, 2005.

[78] S. Sur, U. K. R. Bondhugula, A. Mamidala, H.-W.Jin, and D. K. Panda. High performance rdma basedall-to-all broadcast for infiniband clusters. InInternational Conference on High-PerformanceComputing, pages 148–157. Springer, 2005.

[79] R. Thakur and W. D. Gropp. Improving theperformance of collective operations in mpich. InEuropean Parallel Virtual Machine/Message Passing


[80] R. Thakur, R. Rabenseifner, and W. Gropp.Optimization of collective communication operationsin mpich. International Journal of High PerformanceComputing Applications, 19(1):49–66, 2005.

[81] S. S. Vadhiyar, G. E. Fagg, and J. Dongarra.Automatically tuned collective communications. InProceedings of the 2000 ACM/IEEE Conference onSupercomputing, page 3. IEEE Computer Society,2000.

[82] N. V. Vladimir. Statistical learning theory. Xu JH andZhang XG. translation. Beijing: Publishing House ofElectronics Industry, 2O04, 1998.

[83] R. Vuduc, J. W. Demmel, and J. A. Bilmes. Statisticalmodels for empirical search-based performance tuning.International Journal of High Performance ComputingApplications, 18(1):65–94, 2004.

[84] M. Welsh, D. Oppenheimer, and D. Culler. U-net/sle:A java-based user-customizable virtual networkinterface. Scientific Programming, 7(2):147–156, 1999.

[85] R. C. Whaley and J. J. Dongarra. Automaticallytuned linear algebra software. In Proceedings of the1998 ACM/IEEE conference on Supercomputing,pages 1–27. IEEE Computer Society, 1998.

[86] R. C. Whaley, A. Petitet, and J. J. Dongarra.Automated empirical optimizations of software andthe atlas project. Parallel Computing, 27(1):3–35,2001.

[87] F. Wolf. Automatic performance modeling of hpcapplications.

[88] M. Wozniak, M. Grana, and E. Corchado. A survey ofmultiple classifier systems as hybrid systems.Information Fusion, 16:3–17, 2014.

a survey of methods for collective communication ... · work [80, 79] for collective communication...

Documents