an auto-tuning framework for parallel multicore stencil...

An Auto-Tuning Frameworkfor Parallel Multicore Stencil Computations

Shoaib Kamil†‡, Cy Chan†?, Leonid Oliker†, John Shalf†, Samuel Williams††CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, CA 94720‡EECS Department, University of California at Berkeley, Berkeley, CA 94720

? CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139

AbstractAlthough stencil auto-tuning has shown

tremendous potential in effectively utilizing ar-chitectural resources, it has hitherto been lim-ited to single kernel instantiations; in ad-dition, the large variety of kernels used inpractice makes this computation pattern dif-ficult to assemble into a library. This workpresents a stencil auto-tuning framework thatsignificantly advances programmer productiv-ity by automatically converting a straightfor-ward Fortran 95 stencil expression to tuned im-plementations in Fortran, C, or CUDA, thusallowing performance portability across di-verse computer architectures, including theAMD Barcelona, Intel Nehalem, Sun Victo-ria Falls, and the latest NVIDIA GPUs. Re-sults show that our generalized methodologydelivers significant performance gains of upto 22× speedup over the reference serial im-plementation. Overall we demonstrate thatsuch domain-specific auto-tuners hold enor-mous promise for architectural efficiency, pro-grammer productivity, performance portabil-ity, and algorithmic adaptability on existingand emerging multicore systems.

1 IntroductionPetascale systems are becoming available

to the computational science community withan increasing diversity of architectural models.These high-end systems, like all computing

platforms, will increasingly rely on software-controlled on-chip parallelism to managethe trade-offs between performance, energy-efficiency and reliability [1]. This results ina daunting problem for performance-orientedprogrammers. Scientific progress will be sub-stantially slowed without productive program-ming models and tools that allow programmersto efficiently utilize massive on-chip concur-rency. The challenge of our decade is to cre-ate new programming models and tools thatenable concise program expression while ex-posing fine-grained, explicit parallelism to thehardware across a diversity of chip multipro-cessors (CMPs). Exotic programming modelsand domain-specific languages have been pro-posed to meet this challenge but run counter tothe desire to preserve the enormous investmentin existing software infrastructure.

This work presents a novel approach foraddressing these conflicting requirements forstencil-based computations using a general-ized auto-tuning framework. Our frameworktakes as input a straightforward Fortran 95stencil expression and automatically gener-ates tuned implementations in Fortran, C, orCUDA, thus providing performance portabil-ity across diverse architectures that range fromconventional multicore processors to some ofthe latest graphics processing units (GPUs).This approach enables a viable migration pathfrom existing applications to codes that pro-

vide scalable intra-socket parallelism across adiversity of emerging chip multiprocessors —preserving portability and allowing for produc-tive code design and evolution.

Our work addresses the performance andportability of stencil (nearest-neighbor) com-putations, a class of algorithms at the heart ofmany calculations involving structured (rect-angular) grids, including both implicit andexplicit partial differential equation (PDE)solvers. These solvers constitute a large frac-tion of scientific applications in such diverseareas as heat diffusion, climate science, elec-tromagnetics, and fluid dynamics. Previousefforts have successfully developed stencil-specific auto-tuners [4, 14, 21], which searchover a set of optimizations and their param-eters to minimize runtime and provide per-formance portability across a variety of ar-chitectural designs. Unfortunately, the stencilauto-tuning work to date has been limited tostatic kernel instantiations with pre-specifieddata structures. As such, they are not suitablefor encapsulation into libraries usable by mostreal-world applications.

The focus of our study is to examinethe potential of a generalized stencil auto-parallelization and auto-tuning framework,which can effectively optimize a broad rangeof stencil computations with varying datastructures, dimensionalities, and topologies.Our novel methodology first builds an ab-stract representation from a straightforwarduser-provided Fortran stencil problem specifi-cation. We then use this intermediate represen-tation to explore numerous auto-tuning trans-formations. Finally, our infrastructure is capa-ble of generating a variety of shared-memoryparallel (SMP) backend code instantiations, al-lowing optimized performance across a broadrange of architectures.

To demonstrate the flexibility of our frame-work, we examine three stencil computationswith a variety of different computational char-

acteristics, arising from the 3D Laplacian, Di-vergence, and Gradient differential operators.Auto-parallelized and auto-tuned performanceis then shown on several leading multicoreplatforms, including the AMD Barcelona, SunVictoria Falls, NVIDIA GTX280, and the re-cently released Intel Nehalem. Results showthat our generalized methodology can deliversignificant performance gains of up to 22×speedup compared with the reference serialversion, while allowing portability across di-verse CMP technologies. Furthermore, whileour framework only required a few minutes ofhuman effort to instrument each stencil code,the resulting code achieved performance com-parable to previous hand-optimized code thatrequired several months of tedious work toproduce.

Overall we show that such domain-specificauto-tuners hold enormous promise for ar-chitectural efficiency, programmer productiv-ity, performance portability, and algorithmicadaptability on existing and future multicoresystems.

2 Related WorkAuto-tuning has been applied to a number of

scientific kernels, most successfully to denseand sparse linear algebra. ATLAS [24] is asystem that implements BLAS (basic linear al-gebra subroutines) and some LAPACK [15]kernels using compile-time auto-tuning. Sim-ilarly, OSKI [23] applies auto-tuning tech-niques to sparse linear algebra kernels, usinga combination of compile-time and run-timetuning. FFTW [8] is a similar system for pro-ducing auto-tuned efficient signal processingkernels.

Unlike the systems above, SPIRAL [19] isa recent auto-tuning framework that imple-ments a compiler for a specific class of ker-nels, producing high-performance tuned sig-nal processing kernels. While previous auto-tuners relied on simple string manipulation,

SPIRAL’s designers defined an algebra suit-able for describing a class of kernels, and builta tuning system for that class. Our work’s goalis to create a similar system for stencil auto-tuning.

Optimizing stencil calculations have primar-ily focused on on tiling optimizations [16, 20,21] that attempt to exploit locality by perform-ing operations on cache-sized blocks of databefore moving on to the next block. A study ofstencil optimization [3] on (single-core) cache-based platforms found that tiling optimizationswere primarily effective when the problem sizeexceeded the on-chip cache’s ability to exploittemporal recurrences. Previous work in sten-cil auto-tuning for multicore and GPUs [3, 4]demonstrated the potential for greatly speed-ing up stencil kernels, but concentrated on asingle kernel (the 7-pt Laplacian in 3D). Be-cause the tuning system was hand-coded forthat particular kernel, it cannot easily be portedto other stencils instantiations. In addition, itdoes not allow easy composition of differentoptimizations, or the integration of differentsearch strategies such as hill-climbing.

Compiler optimizations for stencils concen-trate on the polyhedral model [2] for improv-ing performance by altering traversals to min-imize (modeled) memory overhead. Auto-parallelization of stencil kernels is also pos-sible using the polyhedral model [6]. Fu-ture work will combine/compare the polyhe-dral model with our auto-tuning system to ex-plore the tradeoffs of simple models and com-prehensive auto-tuning. Some planned opti-mizations (particularly those that alter the datastructures of the grid) are currently not handledby the polyhedral model.

The goal of our framework is not just to au-tomatically generate parallel stencil codes andtune the associated parallelization parameters,but also to tune the parallelization parame-ters in tandem with lower-level serial optimiza-tions. Doing so will find the globally optimal

combination for a given parallel machine.Previous work also includes the ParAgent

[18] tool, which uses static analysis to helpminimize communication between computenodes in a distributed memory system. Incontrast, our framework addresses both local-ity and parallelization parameters on a sharedmemory system. Like our framework, thePLuTo system [6] strives to simultaneously op-timize parameters for both data locality andparallelization. Their approach determines aparameterization through the minimization ofa unified cost function that incorporates as-pects of both intra-tile locality and inter-tilecommunication. Our work differs from thesemethods primarily in that we leverage auto-tuning to find optimal solutions, which mayprovide better results in cases where machinecharacteristics are difficult to model effectivelyusing static analysis.

This work presents a novel advancementin auto-tuning stencil kernels by building aframework that incorporates experience gainedfrom building kernel-specific tuners. In par-ticular, the framework has the applicability ofa general domain-specific compiler like SPI-RAL, while supporting multiple backend ar-chitectures. In addition, modularity allows theframework to, in the future, support data struc-ture transformations and additional front- andbackends using a simple plugin architecture. Apreliminary overview of our methodology waspresented at a recent Cray User’s Group Work-shop [13]; our current work extends this frame-work for a wider spectrum of optimizationsand architectures including Victoria Falls andGPUs, as well as incorporating performancemodel expectations and analysis.

3 Stencil & ArchitecturesStencil computations on regular grids are at

the core of a wide range of scientific codes.These applications are often implemented us-ing iterative finite-difference techniques that

Cache Flops Compulsory Write Capacity Naıve Tuned ExpectedRefs. per Read

WritebackAllocate Miss Arith. Arith. Auto-tuning

Stencil (doubles) Stencil TrafficTraffic

Traffic Traffic Intensity Intensity BenefitLaplacian 8 8 8 Bytes 8 Bytes 8 Bytes 16 Bytes 0.20 0.33 1.66×

Divergence 7 8 24 Bytes 8 Bytes 8 Bytes 16 Bytes 0.14 0.20 1.40×Gradient 9 6 8 Bytes 24 Bytes 24 Bytes 16 Bytes 0.08 0.11 1.28×

Table 1. Average stencil characteristics. Arithmetic Intensity is defined as the Total Flops / Totalbytes. Capacity misses represent a reasonable estimate for cache-based superscalar processors.Auto-tuning benefit is a reasonable estimate based on the improvement in arithmetic intensityassuming a memory bound kernel without conflict misses.

sweep over a spatial grid, performing nearestneighbor computations called stencils. In astencil operation, each point in a multidimen-sional grid is updated with weighted contribu-tions from a subset of its neighbors within afixed distance in both time and space, locallysolving a discretized version of the PDE forthat data element. These operations are thenused to build solvers that range from simple Ja-cobi iterations to complex multigrid and adap-tive mesh refinement methods.

Stencil calculations perform repeatedsweeps through data structures that aretypically much larger than the data cachesof modern microprocessors. As a result,these computations generally produce a highmemory traffic for relatively little computa-tion, causing performance to be bound bymemory throughput rather than floating-pointoperations. Reorganizing these stencil cal-culations to take full advantage of memoryhierarchies has therefore been the subject ofmuch investigation over the years.

Although these recent studies have success-fully shown auto-tuning’s ability to achieveperformance portability across the breadth ofexisting multicore processors, they have beenconstrained to a single stencil instantiation,thus failing to provide broad applicability togeneral stencil kernels due to the immense ef-fort required to hand-write auto-tuners. In thiswork, we rectify this limitation by evolvingthe auto-tuning methodology into a general-

ized code generation framework, allowing sig-nificant flexibility compared to previous ap-proaches that use prepackaged sets of lim-ited functionality library routines. Our ap-proach complements existing compiler tech-nology and accommodates new architecture-specific languages such as CUDA. Imple-mentation of these kernels using existinglanguages and compilers destroys domain-specific knowledge. As such, compilers havedifficulty proving that code transformationsare safe, and even more difficulty transform-ing data layout in memory. The frameworkside-steps the complex task of analysis andpresents a simple, uniform, and familiar inter-face for expressing stencil kernels as a conven-tional Fortran expression — while presenting aproof-of-concept for other potential classes ofdomain-specific generalized auto-tuners.

3.1 Benchmark Kernels

To show the broad utility of our frame-work, we select three conceptually easy-to-understand, yet deceptively difficult to opti-mize stencil kernels arising from the appli-cation of the finite difference method to theLaplacian (unext ← ∇2u), Divergence (u ←∇ · F) and Gradient (F← ∇u) differential op-erators. Details of these kernels are shown inFigure 1 and Table 1. All three operators areimplemented using central-difference on a 3Drectahedral block-structured grid via Jacobi’smethod (out-of-place), and benchmarked on

x y

z

x y

z

x y

z

do k=2,nz-1,1

do j=2,ny-1,1 do i=2,nx-1,1

uNext(i,j,k)= alpha*u(i,j,k)+

beta*(u(i+1,j,k)+u(i-1,j,k)+ u(i,j+1,k)+u(i,j-1,k)+ u(i,j,k+1)+u(i,j,k-1) )

enddo enddo enddo

do k=2,nz-1,1

do j=2,ny-1,1 do i=2,nx-1,1

u(i,j,k)= alpha*( x(i+1,j,k)-x(i-1,j,k) )+

beta*( y(i,j+1,k)-y(i,j-1,k) )+ gamma*( z(i,j,k+1)-z(i,j,k-1) )

enddo

enddo enddo

do k=2,nz-1,1

do j=2,ny-1,1 do i=2,nx-1,1

x(i,j,k)=alpha*( u(i+1,j,k)-u(i-1,j,k) ) y(i,j,k)= beta*( u(i,j+1,k)-u(i,j-1,k) )

z(i,j,k)=gamma*( u(i,j,k+1)-u(i,j,k-1) )

enddo enddo

enddo

xy product

read_array[ ][ ] x dimension

write_array[ ]

xy product write_array[ ][ ]

x dimension read_array[ ]

xy product

write_array[ ]

x dimension read_array[ ] x y z

u

x y z

u

u’

u

(a) (b) (c)

Figure 1. (a) Laplacian, (b) Divergence, and (c) Gradient stencils. Top: 3D visualization of thenearest neighbor stencil operator. Middle: code as passed to the parser. Bottom: memory accesspattern as the stencil sweeps from left to right. Note: the color represents cartesian componentof the vector fields (scalar fields are gray).

a 2563 grid. Note that although the codegenerator has no restrictions on data struc-ture, for brevity, we only explore the use ofstructure of arrays for vector fields. As de-scribed below, these kernels have such lowarithmetic intensity that they are expected tobe memory-bandwidth bound, and thus deliverperformance approximately equal to the prod-uct of their arithmetic intensity — defined asthe ratio of arithmetic operations to memorytraffic — and the system stream bandwidth.

Table 1 presents the characteristics of thethree stencil operators and sets performanceexpectations. Similar to the 3C’s cachemodel [12], we break memory traffic into com-pulsory read, write back, write allocate, andcapacity misses. A naıve implementation willproduce memory traffic equal to the sum ofthese components, and will therefore result

in the shown arithmetic intensity ( totalflopstotalbytes

),ranging from 0.20–0.08. The auto-tuning ef-fort explored in this work attempts to improveperformance by eliminating capacity misses;thus it is possible to bound the resultant arith-metic intensity based only on compulsory read,write back, and write allocate memory traf-fic. For the three examined kernels, capacitymisses account for dramatically different frac-tions of the total memory traffic. Thus, we canalso bound the resultant potential performanceboost from auto-tuning per kernel — 1.66×,1.40×, and 1.28× for the Laplacian, Diver-gence, and Gradient respectively. Moreover,note that the kernel’s auto-tuned arithmetic in-tensity will vary substantially from each other,ranging from 0.33–0.11. As such, performanceis expected to vary proportionally, as predictedin by the Roofline model [25].

Core AMD Intel Sun NVIDIAArchitecture Barcelona Nehalem Niagara2 GT200 SM

superscalar superscalar HW multithread HW multithreadTypeout of order out of order dual issue SIMD

Clock (GHz) 2.30 2.66 1.16 1.3DP GFlop/s 9.2 10.7 1.16 2.6Local-Store — — — 16KB∗∗

L1 Data Cache 64KB 32KB 8KB —private L2 cache 512KB 256KB — —

Opteron 2356 Xeon X5550 UltraSparc T5140 GeForceSystem(Barcelona) (Gainestown) (Victoria Falls) GTX280

# Sockets 2 2 2 1Cores per Socket 4 4 8 30

Threads per Socket‡ 4 8 64 240primary memory Multithreading

parallelism paradigmHW prefetch HW prefetch Multithreading

with coalescing2×2MB 2×8MB 2×4MB

1GB (device)DRAM Capacity 16GB 12GB 32GB4GB (host)

DRAM Pin 42.66(read) 141 (device)Bandwidth (GB/s)

21.33 51.221.33(write) 4 (PCIe)

DP GFlop/s 73.6 85.3 18.7 78DP Flop:Byte Ratio 3.45 1.66 0.29 0.55

Threading Pthreads Pthreads Pthreads CUDA 2.0Compiler gcc 4.1.2 gcc 4.3.2 gcc 4.2.0 nvcc 0.2.1221

Table 2. Architectural summary of evaluated platforms. †Each of 2 thread groups may issue upto 1 instruction. ‡A CUDA thread block is considered 1 thread. ∗∗16 KB local-store shared by allconcurrent CUDA thread blocks on the SM.

3.2 Experimental Platforms

To evaluate our stencil auto-tuning frame-work, we examine a broad range of leadingmulticore designs: AMD Barcelona, Intel Ne-halem, Sun Victoria Falls, and NVIDIA GTX280. A summary of key architectural fea-tures of the evaluated systems appears in Ta-ble 2; space limitations restrict detailed de-scriptions of the systems. Observe, as all ar-chitectures have Flop:DRAM byte ratios sig-nificantly greater than the arithmetic intensi-ties mentioned in Section 3.1, we expect allarchitectures to be memory bound. Note thatthe sustained system power data was obtainedusing an in-line digital power meter while thenode was under a full computational load,while chip and GPU card power is based on

the maximum Thermal Design Power (TDP),extrapolated from manufacturer’s datasheets.Although the node architectures are diverse,most accurately represent building-blocks ofcurrent and future ultra-scale supercomputingsystems.

4 Auto-tuning FrameworkStencil applications use a wide variety of

data structures in their implementations, repre-senting grids of multiple dimensionalities andtopologies. Furthermore, the details of the un-derlying stencil applications call for a myriadof numerical kernel operations. Thus, buildinga static auto-tuning library in the spirit of AT-LAS [24] or OSKI [23] to implement the manydifferent stencil kernels is infeasible.

This work presents a proof-of-concept of a

.f95 .cu .f95 .h

Reference Implementation

Myriad of equivalent, optimized, implementations

(plus test harness)

.c

Best performing implementation

and configuration parameters

.c

Pars

e

Internal Abstract Syntax Tree

Representation

Code Generators

C with

pthreads

CUDA

FORTRAN

Strategy Engines

Parallel

Serial

GTX280

Victoria Falls

Search Engines

in context

of specific problem

Transformation & Code Generation with high-level knowledge

.f95 .cu .f95 .h

Reference Implementation

Myriad of equivalent, optimized, implementations

(plus test harness)

.c

Best performing implementation

and configuration parameters

.c

Pars

e

Internal Abstract Syntax Tree

Representation

Code Generators

C with

pthreads

CUDA

FORTRAN

Strategy Engines

Parallel

Serial

GTX280

Victoria Falls

Search Engines

in context

of specific problem

Transformation & Code Generation with high-level knowledge

Strategy Engines!GPU! Parallel

x86!Serial x86!

Transformation!Engine!

Figure 2. Stencil auto-tuning framework flow. Readable domain-specific code is parsed intoan abstract representation, transformations are applied, code is generated using specific targetbackends, and the optimal auto-tuned implementation is determined via search.

generalized auto-tuning approach, which usesa domain-specific transformation and code-generation framework combined with a fully-automated search to replace stencil kernelswith their optimized versions. The interactionwith the application program begins with sim-ple annotation of the loops targeted for opti-mization. The search system then extracts eachdesignated loop and builds a test harness forthat particular kernel instantiation. Next, thesearch system uses the transformation and gen-eration framework to apply our suite of auto-tuning optimizations, running the test harnessfor each candidate implementation to deter-mine its optimal performance. After the searchis complete, the optimized implementation isbuilt into an application-specific library thatis called in place of the original. The overallflow through the auto-tuning system is shownin Figure 2.

4.1 Front-End Parsing

The front-end to the tranformation engineparses a description of the stencil in a domain-specific language. For simplicity, we use asubset of Fortran 95, since many stencil appli-cations are already written in some flavor ofFortran. Due to the modularity of the trans-formation engine, a variety of front-end imple-mentations are possible. The result of parsingin our preliminary implementation is an Ab-stract Syntax Tree (AST) representation of thestencil, on which subsequent transformationsare performed.

4.2 Stencil Kernel Breadth

Currently, the auto-tuning system handles aspecific class of stencil kernels of certain di-mensionality and code structure. In particu-lar, the auto-tuning system assumes a 2D or3D rectahedral grid, and a stencil based onarithmetic operations and table lookups (ar-ray accesses). Future work will further extendthe generality to allow grids of arbitrary di-mensionality. Although this proof-of-conceptframework does auto-tune serial kernels withimperfect loop nests, the parallel tuning relieson perfect nesting in order to determine le-gal domain decompositions and NUMA (non-uniform memory access) page mapping initial-ization — future framework extensions will in-corporate imperfectly nested loops. Addition-ally, we currently treat boundary calculationsas a separate stencil, although future versionsmay integrate stencils with overlapping traver-sals into a single stencil. Overall, our auto-tuning system can target and accelerate a largegroup of stencil kernels currently in use, whileactive research continues to extend the gener-ality of the framework.

5 Optimization & CodegenThe heart of the auto-tuning framework

is the transformation engine and the back-end code generation for both serial and paral-lel implementations. The transformation en-gine is in many respects similar to a source-to-source translator, but it exploits domain-specific knowledge of the problem space to im-plement transformations that would otherwise

be difficult to implement as a fully generalizedloop optimization within a conventional com-piler. Serial backend targets generate portableC and Fortran code, while parallel targets in-clude pthreads C code designed to run ona variety of cache-based multicore processornodes as well as CUDA versions specificallyfor the massively parallel NVIDIA GPGPUs.

Once the intermediate form is created fromthe front-end description, it is manipulated bythe transformation engine across our spectrumof auto-tuned optimizations. The intermedi-ate form and transformations are expressedin portable Lisp code using the portable andlightweight ECL compiler [5], making it sim-ple to interface with the parsing front-ends(written in Flex and YACC) and preservingportability across a wide variety of architec-tures. Potential future alternatives include im-plemention of affine scaling transformations ormore complex AST representations, such asthe one used by LLVM [17], or more sophis-ticated transformation engines such as the oneprovided by the Sketch [22] compiler.

Because optimizations are expressed astransformations on the AST, it is possible tocombine them in ways that would otherwisebe difficult using simple string substitution.For example, it is straightforward to applyregister blocking either before or after cache-blocking the loop, allowing for a compre-hensive exploration of optimization configu-rations. In the rest of this section, we dis-cuss serial transformations and code genera-tion; auto-parallelization and parallel-specifictransformations and generators are explored inSection 5.2.

5.1 Serial OptimizationsSeveral common optimizations have been

implemented in the framework as AST trans-formations, including loop unrolling/registerblocking (to improve innermost loop effi-ciency), cache blocking (to expose tempo-ral locality and increase cache reuse), and

arithmetic simplification/constant propagation.These optimizations are implemented to takeadvantage of the specific domain of interest:Jacobi-like stencil kernels of arbitrary dimen-sionality. Future transformations will includethose shown in previous work [4]:better uti-lization of SIMD instructions and commonsubexpression elimination (to improve arith-metic efficiency), cache bypass (to eliminatecache fills), and explicit software prefetching.Additionally, future work will support aggres-sive memory and code structure transforma-tions.

We also note that, although the current setof optimizations may seem identical to ex-isting compiler optimizations, future strate-gies such as memory structure transforma-tions will be beyond the scope of compilers,since such optimizations are specific to stencil-based computations. Additionally, the fact thatour framework’s transformations yield codethat outperforms compiler-only optimized ver-sions shows compiler algorithms cannot al-ways prove that these (safe) optimizations areallowed. Thus, a domain-specific code gener-ator run by the user has the freedom to imple-ment transformations that a compiler may not.

5.2 Parallelization & Code GenerationGiven the stencil transformation framework,

we now present parallelization optimizations,as well as cache- and GPU-specific optimiza-tions. The shared-memory parallel code gen-erators leverage the serial code generation rou-tines to produce the version run by each in-dividual thread. Because the parallelizationmechanisms are specific to each architecture,both the strategy engines and code generatorsmust be tailored to the desired targets. Forthe cache-based systems (Intel, AMD, Sun) wechose pthreads for lightweight paralleliza-tion; on the NVIDIA GPU, the only paral-lelization option is CUDA thread blocks thatexecute in a SPMD (single program multipledata) fashion.

Since the parallelization strategy influencescode structure, the AST — which representscode run on each individual thread — must bemodified to reflect the chosen parallelizationstrategy. The parallel code generators makethe necessary modifications to the AST beforepassing it to the serial code generator.

5.2.1 Multicore-specific Optimizationsand Code Generation

Following the effective blocking strategy pre-sented in previous studies [4], we decomposethe problem space into core blocks, as shownin Figure 3. The size of these core blocks canbe tuned to avoid capacity misses in the lastlevel cache. Each core block is further dividedinto thread blocks such that threads sharing acommon cache can cooperate on a core block.Though our code generator is capable of usingvariable-sized thread blocks, we set the sizeof the thread blocks equal to the size of thecore blocks to help reduce the size of the auto-tuning search space. The threads of a threadblock are then assigned chunks of contiguouscore blocks in a round robin fashion until theentire problem space has been accounted for.Finally each thread’s stencil loop is registerblocked to best utilize registers and functionalunits. The core block size, thread block size,chunk size, and register block size are all tun-able by the framework.

The code generator creates a new set ofloops for each thread to iterate over its as-signed set of thread blocks. Register blockingis accomplished through strip mining and loopunrolling via the serial code generator.

NUMA-aware memory allocation is imple-mented by pinning threads to the hardware andtaking advantage of first-touch page mappingpolicy during data initialization. The code gen-erator analyzes the decomposition and has theappropriate processor touch the memory dur-ing initialization.

5.2.2 CUDA-specific Optimizations andCode Generation

CUDA programming is oriented aroundCUDA thread blocks, which differ fromthe thread blocks used in the previoussection. CUDA thread blocks are vectorelements mapped to the scalar cores (lanes)of a streaming multiprocessor. The vectorconceptualization facilitates debugging ofperformance issues on GPUs. Moreover,CUDA thread blocks are analogous to threadsrunning SIMD code on superscalar processors.Thus, parallelization on the GTX280 is astraightforward SPMD domain decompositionamong CUDA thread blocks; within eachCUDA thread block, work is parallelized in aSIMD manner.

To effectively exploit cache-based systems,code optimizations attempt to employ unit-stride memory access patterns and maintainsmall cache working sets through cache block-ing — thereby leveraging spatial and tempo-ral locality. In contrast, the GPGPU modelforces programmers to write a program foreach CUDA thread. Thus, spatial locality mayonly be achieved by ensuring that memory ac-cesses of adjacent threads (in a CUDA threadblock) reference contiguous segments of mem-ory to exploit hardware coalescing. Conse-quently, our GPU implementation ensures spa-tial locality for each stencil point by taskingadjacent threads of a CUDA thread block toperform stencil operations on adjacent grid lo-cations. Some performance will be lost as notall coalesced memory references are aligned to128-byte boundaries.

The CUDA code generator is capable of ex-ploring the myriad different ways of dividingthe problem among CUDA thread blocks, aswell as tuning both the number of threads in aCUDA thread block and the access pattern ofthe threads. For example, in a single time step,a CUDA thread block of 256 CUDA threadsmay access a tile of 32 x 4 x 2 contiguous dataelements; the thread block would then iterate

+Y

+Z

(b)Decomposition into

Thread Blocks

(c)Decomposition into

Register Blocks

(a)Decomposition of a Node Block

into a Chunk of Core Blocks

RYRXRZ

CY

CZ

CX

TYTX

NY

NZ

NX

+X(unit stride) TY

CZ

TX

Figure 3. Four-level problem decomposition: In (a), a node block (the full grid) is broken intosmaller chunks. All core blocks in a chunk are processed by the same subset of threads. Onecore block from the chunk in (a) is magnified in (b). A single thread block from the core blockin (b) is then magnified in (c). A thread block should exploit common resources among threads.Finally, the magnified thread block in (c) is decomposed into register blocks, which exploit datalevel parallelism.

this tile shape over its assigned core block.In many ways, this exploration is analogousto register blocking within each core block oncache-based architectures.

Our code generator currently only supportsthe use of global device memory, and so doesnot take advantage of the low-latency local-store style shared memory present on the GPU.As such, the generated code does not take ad-vantage of temporal locality of the memory ac-cesses. Future work will incorporate supportfor exploitation of CUDA shared memory.

6 Auto-Tuning Strategy EngineIn this section, we describe how the auto-

tuner searches the enormous parameter spaceof serial and parallel optimizations describedin previous sections. Because the combinedparameter space of the preceding optimiza-tions is so large, it is clearly infeasible to tryall possible strategies. In order to reduce thenumber of code instantiations the auto-tunermust compile and evaluate, we used strategyengines to enumerate an appropriate subset ofthe parameter space for each platform.

The strategy engines enumerate only thoseparameter combinations (strategies) in the sub-region of the full search space that best uti-lize the underlying architecture. For exam-

ple, cache blocking in the unit stride dimensioncould be practical on the Victoria Falls archi-tecture, while on Barcelona or Nehalem, thepresence of hardware prefetchers makes sucha transformation non-beneficial [3].

Further, the strategy engines keep track ofparameter interactions to ensure that only le-gal strategies are enumerated. For example,since the parallel decomposition changes thesize and shape of the data block assigned toeach thread, the space of legal serial optimiza-tion parameters dependends on the values ofthe parallel parameters. The strategy enginesensure all such constraints (in addition to otherhardware restrictions such as maximum num-ber of threads per processor) are satisfied dur-ing enumeration.

For each parameter combination enumer-ated by the strategy engine, the auto-tuner’ssearch engine then directs the parallel and se-rial code generator components to produce thecode instantiation corresponding to that strat-egy. The auto-tuner runs each instantiationand records the time taken on the target ma-chine. After all enumerated strategies havebeen timed, the fastest parameter combinationis reported to the user, who can then link theoptimized version of the stencil into their ex-isting code.

Table 3 in the Appendix shows the at-tempted optimizations and the associated pa-rameter subspace explored by the strategy en-gines corresponding to each of our testedplat-forms. While the search engine currently doesa comprehensive search over the parametersubspace dictated by the strategy engine, fu-ture work will include more intelligent searchmechanisms such as hill-climbing or machinelearning techniques [9], where the search en-gine can use timing feedback to dynamicallydirect the search.

7 Performance EvaluationIn this section, we examine the per-

formance quality and expectations of ourauto-parallelizing and auto-tuning frameworkacross the four evaluated architectural plat-forms. The impact of our framework oneach of the three kernels is compared in Fig-ure 4, showing performance of: the origi-nal serial kernel (gray), auto-parallelization(blue), auto-parallelization with NUMA-awareinitialization (purple), and auto-tuning (red).The GTX280 reference performance (blue)is based on a straightforward implementationthat maximizes CUDA thread parallelism. Wedo not consider the impact of host transferoverhead; previous work [4] examined this po-tentially significant bottleneck in detail. Over-all, results are ordered such that threads firstexploit multithreading within a core, then mul-tiple cores on a socket, and finally multiplesockets. Thus, on Nehalem, the two threadcase represents one fully-packed core; simi-larly, the GTX280 requires at least 30 CUDAthread blocks to utilize the 30 cores (streamingmultiprocessors).

7.1 Auto-Parallelization Performance

The auto-parallelization scheme specifies astraightforward domain decomposition overthreads in the least unit-stride dimension, withno core, thread, or register blocking. To ex-

amine the quality of the framework’s auto-parallelization capabilities, we compare per-formance with a parallelized version usingOpenMP [7], which ensures proper NUMAmemory decomposition via first-touch pinningpolicy. Results, shown as yellow diamonds inFigure 4, show that performance is well cor-related with our framework’s NUMA-awareauto-parallelization. Furthermore, our ap-proach slightly improves Barcelona’s perfor-mance, while Nehalem and Victoria Falls seeup to a 17% and 25% speedup (respectively)compared to the OpenMP version, indicat-ing the effectiveness of our auto-parallelizationmethodology even before auto-tuning.

7.2 Performance ExpectationsWhen tuning any application, it is impor-

tant to know when you have reached the ar-chitectural peak performance, and have littleto gain from continued optimization. We makeuse of a simple empirical performance modelto establish this point of diminishing returnsand use it to evaluate how close our auto-mated approach can come to the machine lim-its. We now examine achieved performancein the context of this simple model based onthe hardware’s characteristics. Assuming allkernels are memory bound and do not sufferfrom an abundance of capacity misses, we ap-proximate the performance bound as the prod-uct of streaming bandwidth and each stencil’sarithmetic intensity (0.33, 0.20 and 0.11 — asshown in Table 1). Using an optimized ver-sion of the OpenMP Stream benchmark, whichwe modified to reflect the number of read andwrite streams for each kernel, we obtain ex-pected peak performance based on memorybandwidth for the CPUs. For the GPU, we usetwo versions of Stream: one that consists ofexclusively read traffic, and another that is halfread and half write.

Our model’s expected performance range isrepresented as a green line (for the CPUs) anda green region (for the GPUs) in Figure 4. For

LAPLACIAN

0

1

2

3

4

5

1 2 4 8

GF

lop

/s

Threads

Barcelona

0

2

4

6

8

10

12

1 2 4 8 16

GF

lop

/s

Threads

Nehalem

0

1

2

3

4

5

6

7

8

9

8 16 32 64 128

GF

lop

/s

Threads

Victoria Falls

0

2

4

6

8

10

12

14

16

Ref 1 2 4 8 16 32 64 128256512 1K

GF

lop

/s

CUDA Thread Blocks

GTX280

DIVERGENCE

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8

GF

lop

/s

Threads

Barcelona

0

1

2

3

4

5

6

7

1 2 4 8 16

GF

lop

/s

Threads

Nehalem

0

1

2

3

4

5

8 16 32 64 128

GF

lop

/s

Threads

Victoria Falls

0

2

4

6

8

10

12

14

16

18

Ref 1 2 4 8 16 32 64 128256512 1K

GF

lop

/s

CUDA Thread Blocks

GTX280

GRADIENT

0

0.5

1

1.5

2

1 2 4 8

GF

lop

/s

Threads

Barcelona

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 4 8 16

GF

lop

/s

Threads

Nehalem

0

0.5

1

1.5

2

2.5

3

8 16 32 64 128

GF

lop

/s

Threads

Victoria Falls

0

2

4

6

8

10

Ref 1 2 4 8 16 32 64 128256512 1K

GF

lop

/s

CUDA Thread Blocks

GTX280

-2

0

2

4

6

8

10

12

1 2 4 8 16

GFlop

/s

Threads

TIT

baseline auto-parallel +NUMA +auto-tuning OpenMP

Figure 4. Laplacian (top row), Divergence (middle row), and Gradient (bottom row) perfor-mance as a function of auto-parallelization and auto-tuning — on the four evaluated platforms.Note: the green region marks the range in performance extrapolated from Stream bandwidth.For comparison, the yellow diamond shows performance achieved using the original stencil ker-nel with OpenMP pragmas and NUMA-aware initialization.

Barcelona and Nehalem, our optimized ker-nels obtain performance essentially equivalentto peak memory bandwidth. For Victoria Falls,the obtained bandwidth is around 20% lessthan peak for each of the kernels, because ourframework does not currently implement soft-ware prefetching and array padding, which arecritical for performance on this architecture.Finally, the GTX280 results were also belowour performance model bound, likely due to noarray padding [4]. Overall, our fully tuned per-formance closely matches our model’s expec-

tations, while highlighting areas which couldbenefit from additional optimizations.

7.3 Performance PortabilityThe auto-tuning framework takes a serial

specification of the stencil kernel and achievea substantial performance improvement, dueto both auto-parallelization and auto-tuning.Overall, Barcelona and Nehalem see between1.7× to 4× improvement for both the one andtwo socket cases over the conventional paral-lelized case, and up to 10 times improvementover the serial code. The results also show that

auto-tuning is essential on Victoria Falls, en-abling much better scalability and increasingperformance by 2.5× and 1.4× on 64 and 128threads respectively in comparison to the con-ventional parallelized case, but a full 22× im-provement over an unparallelized example. Fi-nally, auto-tuning on the GTX280 boosted per-formance by 1.5× to 2× across the full rangeof kernels — a substantial improvement overthe baseline CUDA code, which is implicitlyparallel. This clearly demonstrates the perfor-mance portability of this framework across thesample kernels.

Overall, we achieve substantial performanceimprovements across a diversity of architec-tures – from GPU’s to multi-socket multicorex86 systems. The auto-tuner is able to achieveresults that are extremely close to the architec-tural peak performance of the system, which islimited ultimately by memory bandwidth. Thislevel of performance portability using a com-mon specification of kernel requirements is un-precedented for stencil codes, and speaks tothe robustness of the generalized framework.

7.4 Programmer Productivity BenefitsWe now compare our framework’s perfor-

mance in context of programming productiv-ity. Our previous work [4] presented the resultsof Laplacian kernel optimization using a hand-written auto-tuning code generator, which re-quired months of Perl script implementation,and was inherently limited to a single kernelinstantiation. In contrast, utilizing our frame-work across a broad range of possible stencilsonly requires a few minutes to annotate a givenkernel region, and pass it through our auto-parallelization and auto-tuning infrastructure;thus tremendously improving productivity aswell as kernel extensibility.

Currently our framework does not imple-ment several hand-tuned optimizations [4], in-cluding SIMDization, padding, or the em-ployment of employ cache bypass (movntpd).However, comparing results over the same set

of optimizations, we find that our frameworkattains excellent performance that is compara-ble to the hand-written version. We obtain nearidentical results on the Barcelona and evenhigher results on the Victoria Falls platform(6 GFlop/s versus 5.3 GFlop/s). A significantdisparity is seen on the GTX280, where pre-vious hand-tuned Laplacian results attained 36GFlop/s, compared with our framework’s 13GFlop/s. For the CUDA implementations, ourautomated version only utilizes optimizationsand code structures applicable to general sten-cils, while the hand-tuned version explicitlydiscovered and exploit the temporal localityspecific to the Laplacian kernel — thus maxi-mizing performance, but limiting the method’sapplicability. Future work will continue incor-porating additional optimization schemes intoour automated framework.

7.5 Architectural ComparisonFigure 5 shows a comparative summary of

the fully tuned performance on each architec-ture. The GTX280 consistently attains thehighest performance, due to its massive par-allelism at high clock rates, but transfer timesfrom system DRAM to board memory throughthe PCI Express bus are not included and couldsignificantly impact performance [4]. Therecently-released Intel Nehalem system offersa substantial improvement over the previousgeneration Intel Clovertown by eliminating thefront-side bus in favor of on-chip memory con-trollers. The Nehalem obtains the best over-all performance of the cache-based systems,due to the combination of high memory band-width per socket and hardware multithreadingto fully utilize the available bandwidth. Ad-ditionally, Victoria Falls obtains high perfor-mance, especially given its low clock speed,thanks to massive parallelism combined withan aggregate memory bandwidth of 64 GB/s.

Power efficiency, measured in (average sten-cil) MFlop/s/Watt, is also shown in Figure 5.For the GTX280 measurements we show the

0

2

4

6

8

10

12

14

GF

lop

/s

Laplacian

0

2

4

6

8

10

12

14

16

GF

lop

/s

Divergence

0

1

2

3

4

5

6

7

8

9

10

GF

lop

/s

Gradient

0

10

20

30

40

50

60

MF

lop

/s/W

att

Avg Pwr Efficiency

0

2

4

6

8

10

12

14

16

18

20

MFl

op/s

/Wat

t

Avg Pwr Efficiency

barcelona nehalem VF GTX280 GTX280(card only)

Figure 5. Peak performance and power efficiency after auto-tuning and parallelization. GTX280power efficiency is shown based on system power as well as the card alone.

power efficiencies both with (red) and withoutthe host system (pink). The GTX280 showsimpressive gains over the cache-based archi-tecture if considered as a standalone device,but if system power is included, the GTX280’sadvantage is diminished and the Nehalem be-comes the most power efficient architectureevaluated in this study.

8 Summary and ConclusionsPerformance programmers are faced with

the enormous challenge of productively de-signing applications that effectively leveragethe computational resources of leading multi-core designs, while allowing for performanceportability across the myriad of current andfuture CMP instantiations. In this work, weintroduce a fully generalized framework forstencil auto-tuning that takes the first steps to-wards making complex chip multiprocessorsand GPUs accessible to domain-scientists, ina productive and performance portable fashion— demonstrating gains of up to 22× speedupcompared with the default serial version.

Overall we make a number of importantcontributions that include the (i) introductionof a high performance, multi-target frameworkfor auto-parallelizing and auto-tuning multi-dimensional stencil loops; (ii) presentation ofa novel tool chain based on an abstract syn-tax tree (AST) for processing, transforming,and generating stencil loops; (iii) descriptionof an automated parallelization process for tar-

geting multidimensional stencil codes on bothcache-based multicore architectures as well asGPGPUs; (iv) achievement of excellent perfor-mance on our evaluation suite using three im-portant stencil access patterns; (v) utilizationof simple performance model that effectivelypredicts the expected performance range for agiven kernel and architecture; and (vi) demon-stration that automated frameworks such asthese can enable greater programmer produc-tivity by reducing the need for individual,hand-coded auto-tuners.

The modular architecture of our frameworkenables it to be extended through the devel-opment of additional parser, strategy engine,and code generator modules. Future work willconcentrate on extending the scope of opti-mizations (see Section 7.4), including cachebypass, padding, and prefetching. Addition-ally, we plan to extend the CUDA backendfor a general local store implementation, thusleveraging temporal locality to improve per-formance and allowing extensibility to otherlocal-store architectures such as the Cell pro-cessor. We also plan to expand our frameworkto broaden the range of allowable stencil com-putation classes (see Section 4.2), includingin-place and multigrid methods. Finally, weplan to demonstrate our framework’s applica-bility by investigating its impact on large-scalescientific applications, including a forthcom-ing optimization study of an icosahedral atmo-spheric climate simulation [10, 11].

References

[1] K. Asanovic, R. Bodik, B. Catanzaro, et al.The landscape of parallel computing research:A view from Berkeley. Technical ReportUCB/EECS-2006-183, EECS, University ofCalifornia, Berkeley, 2006.

[2] C. Bastoul. Code generation in the poly-hedral model is easier than you think. InPACT ’04:Parallel Architectures and Compila-tion Techniques, Washington, DC, 2004.

[3] K. Datta, S. Kamil, S. Williams, L. Oliker,J. Shalf, and K. Yelick. Optimization andperformance modeling of stencil computationson modern microprocessors. SIAM Review,51(1):129–159, 2009.

[4] K. Datta, M. Murphy, V. Volkov, et al. StencilComputation Optimization and Auto-Tuningon State-of-the-art Multicore Architectures. InProceedings of SC ’08, Austin, Texas, 2008.

[5] Embeddable Common Lisp. http://ecls.sourceforge.net/.

[6] B. et al. A practical automatic polyhedralparallelizer and locality optimizer. SIGPLANNot., 43(6):101–113, 2008.

[7] O. A. S. for Parallel Programming. http://openmp.org.

[8] M. Frigo. A fast fourier transform compiler.pages 169–180, 1999.

[9] A. Ganapathi, K. Datta, A. Fox, and D. Patter-son. A case for machine learning to optimizemulticore performance. In Workshop on HotTopics in Parallelism, March 2009.

[10] GreenFlash. http://www.lbl.gov/CS/

html/greenflash.html.[11] R. Heikes and D. Randall. Numerical integra-

tion of the shallow-water equations of a twistedicosahedral grid. part i: basic design and re-sults of tests. Mon. Wea. Rev., 123:1862–1880,1995.

[12] M. D. Hill and A. J. Smith. Evaluating As-sociativity in CPU Caches. IEEE Trans. Com-put., 38(12):1612–1630, 1989.

[13] S. Kamil, C. Chan, S. Williams, et al. Ageneralized framework for auto-tuning stencilcomputations. In Cray User Group, 2009.

[14] S. Kamil, K. Datta, S. Williams, et al. Implicitand explicit optimizations for stencil computa-tions. In Workshop Memory Systems Perfor-mance and Correctness, San Jose, CA, 2006.

[15] LAPACK: Linear Algebra PACKage. http://www.netlib.org/lapack/.

[16] A. Lim, S. Liao, and M. Lam. Blockingand array contraction across arbitrarily nestedloops using affine partitioning. In ACM Sym-posium on Principles and Practice of ParallelProgramming, June 2001.

[17] LLVM Homepage. http://llvm.org/.[18] S. Mitra, S. C. Kothari, J. Cho, and A. Krish-

naswamy. ParAgent: A Domain-Specific Semi-automatic Parallelization Tool, pages 141–148. Springer, 2000.

[19] M. Puschel, J. Moura, J. Johnson, et al.SPIRAL: Code generation for DSP trans-forms. Proceedings of the IEEE, special issueon “Program Generation, Optimization, andAdaptation”, 93(2):232– 275, 2005.

[20] G. Rivera and C. Tseng. Tiling optimizationsfor 3D scientific computations. In Proceedingsof SC’00, Dallas, TX, November 2000.

[21] S. Sellappa and S. Chatterjee. Cache-efficientmultigrid algorithms. International Journalof High Performance Computing Applications,18(1):115–133, 2004.

[22] A. Solar-Lezama, G. Arnold, Liviu, et al.Sketching stencils. In International Confer-ence on Programming Languages Design andImplementation (PLDI), June 2007.

[23] R. Vuduc, J. Demmel, and K. Yelick. OSKI:A library of automatically tuned sparse ma-trix kernels. In Proc. of SciDAC 2005, J. ofPhysics: Conference Series, June 2005.

[24] R. C. Whaley, A. Petitet, and J. Dongarra. Au-tomated Empirical Optimization of Softwareand the ATLAS project. Parallel Computing,27(1-2):3–35, 2001.

[25] S. Williams, A. Watterman, and D. Patter-son. Roofline: An insightful visual perfor-mance model for floating-point programs andmulticore architectures. Communications ofthe ACM, April 2009.

Appendix

Optimization Parameter Tuning Range by ArchitectureCategory Parameter Name Barcelona/Nehalem Victoria Falls GTX280

Data Allocation NUMA Aware X X N/ACX NX {8...NX} {16†..NX}

Core Block Size CY {8...NY} {8...NY} {16†..NY}CZ {128...NZ} {128...NZ} {16†..NZ}

Domain TX CX CX {1..CX16 }

‡

Decomposition Thread Block Size TY CY CY {CY16 ..CY}‡

TZ CZ CZ {CZ16 ..CZ}‡

Chunk Size {1... NX×NY×NZCX×CY×CZ×NThreads} N/A

Array Indexing X X XLow RX {1...8} {1...8} 1Level Register Block Size RY {1...2} {1...2} 1∗

RZ {1...2} {1...2} 1∗

Table 3. Attempted optimizations and the associated parameter spaces explored by the auto-tuner for a 2563 stencil problem (NX, NY, NZ = 256). All numbers are in terms of doubles.† Actual values for minimum core block dimensions for GTX280 dependent on problem size.‡ Thread block size constrained by a maximum of 256 threads in a CUDA thread block withat least 16 threads coalescing memory accesses in the unit-stride dimension. ∗The CUDA codegenerator is capable of register blocking the Y and Z dimensions, but due to a confirmed bug inthe NVIDIA nvcc compiler, register blocking was not explored in our auto-tuned results.

an auto-tuning framework for parallel multicore stencil...

Documents