decoupling algorithms from schedulesor easy …people.csail.mit.edu/fredo/tmp/halide-print.pdfonline...

11
Online Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines Abstract 1 Using existing programming tools, writing high-performance im- 2 age processing code requires sacrificing readability, portability, and 3 modularity. We argue that this is a consequence of conflating what 4 computations define the algorithm, with decisions about storage 5 and the order of computation. We refer to these latter two concerns 6 as the schedule, including choices of tiling, fusion, recomputation 7 vs. storage, vectorization, and parallelism. 8 We propose a representation for feed-forward imaging pipelines 9 that separates the algorithm from its schedule, enabling high- 10 performance without sacrificing code clarity. This decoupling sim- 11 plifies the algorithm specification: images and intermediate buffers 12 become functions over an infinite integer domain, with no explicit 13 storage or boundary conditions. Imaging pipelines are compo- 14 sitions of functions. Programmers separately specify scheduling 15 strategies for the various functions composing the algorithm, which 16 allows them to efficiently explore different optimizations without 17 changing the algorithmic code. 18 We demonstrate the power of this representation by expressing a 19 range of recent image processing applications in an embedded do- 20 main specific language, and compiling them for ARM, x86, and 21 GPUs. Our compiler targets SIMD units, multiple cores, and com- 22 plex memory hierarchies. We demonstrate that it can handle algo- 23 rithms such as a camera raw pipeline, the bilateral grid, fast local 24 Laplacian filtering, and image segmentation. The algorithms ex- 25 pressed in our language are both shorter and faster than state-of- 26 the-art implementations. 27 Keywords: Image Processing, Compilers, Performance 28 1 Introduction 29 Computational photography algorithms require highly efficient 30 implementations to be used in practice, especially on power- 31 constrained mobile devices. This is not a simple matter of program- 32 ming in a low-level language like C. The performance difference 33 between naive C and highly optimized C is often an order of mag- 34 nitude. Unfortunately, this usually comes at the cost of programmer 35 pain and code complexity, as computation must be reorganized to 36 achieve memory efficiency and parallelism. 37 For image processing, the global organization of execution and stor- 38 age is critical. Image processing pipelines are both wide and deep: 39 they consist of many data-parallel stages that benefit hugely from 40 parallel execution across pixels, but stages are often memory band- 41 width limited—they do little work per load and store. Gains in 42 speed therefore come not just from optimizing the inner loops, but 43 also from global program transformations such as tiling and fusion 44 that exploit producer-consumer locality down the pipeline. The best 45 choice of transformations is architecture-specific; implementations 46 optimized for an x86 multicore and a modern GPU often bear little 47 resemblance to each other. 48 In this paper we enable simpler high-performance code by sepa- 49 rating the intrinsic algorithm from the decisions about how to run 50 efficiently on a particular machine (Fig. 2). 51 (a) Clean C++ : 9.94 ms per megapixel void blur(const Image &in, Image &blurred) { Image tmp(in.width(), in.height()); for (int y= 0; y < in.height(); y++) for (int x= 0; x < in.width(); x++) tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3; for (int y= 0; y < in.height(); y++) for (int x= 0; x < in.width(); x++) blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3; } (b) Fast C++ (for x86) : 0.90 ms per megapixel void fast_blur(const Image &in, Image &blurred) { m128i one_third = _mm_set1_epi16(21846); #pragma omp parallel for for (int yTile = 0; yTile < in.height(); yTile += 32) { m128i a, b, c, sum, avg; m128i tmp[(256/8) * (32+2)]; for (int xTile = 0; xTile < in.width(); xTile += 256) { m128i * tmpPtr = tmp; for (int y = -1;y< 32+1; y++) { const uint16_t * inPtr = &(in(xTile, yTile+y)); for (int x = 0;x< 256; x += 8) { a = _mm_loadu_si128(( m128i * )(inPtr-1)); b = _mm_loadu_si128(( m128i * )(inPtr+1)); c = _mm_load_si128(( m128i * )(inPtr)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(tmpPtr++, avg); inPtr += 8; }} tmpPtr = tmp; for (int y = 0;y< 32; y++) { m128i * outPtr = ( m128i * )(&(blurred(xTile, yTile+y))); for (int x = 0;x< 256; x += 8) { a = _mm_load_si128(tmpPtr+(2 * 256)/8); b = _mm_load_si128(tmpPtr+256/8); c = _mm_load_si128(tmpPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg); }}}}} (c) Halide : 0.90 ms per megapixel Func halide_blur(Func in) { Func tmp, blurred; Var x, y, xi, yi; // The algorithm tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3; blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3; // The schedule blurred.tile(x, y, xi, yi, 256, 32).vectorize(xi, 8).parallel(y); tmp.chunk(x).vectorize(x, 8); return blurred; } Figure 1: The code at the top computes a 3x3 box filter in us- ing the composition of a 1x3 box filter and a 3x1 box filter. Using vectorization, multithreading, tiling, and fusion, we can make this algorithm more than 10× faster on a quad-core x86 CPU (mid- dle). However in doing so we’ve lost readability and portability. Our compiler separates the algorithm description from its sched- ule, achieving the same performance without making the same sac- rifices (bottom). For the full details about how this test was carried out, see the supplemental material. 1

Upload: lamthien

Post on 08-Apr-2018

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

Decoupling Algorithms from Schedulesfor Easy Optimization of Image Processing Pipelines

Abstract1

Using existing programming tools, writing high-performance im-2

age processing code requires sacrificing readability, portability, and3

modularity. We argue that this is a consequence of conflating what4

computations define the algorithm, with decisions about storage5

and the order of computation. We refer to these latter two concerns6

as the schedule, including choices of tiling, fusion, recomputation7

vs. storage, vectorization, and parallelism.8

We propose a representation for feed-forward imaging pipelines9

that separates the algorithm from its schedule, enabling high-10

performance without sacrificing code clarity. This decoupling sim-11

plifies the algorithm specification: images and intermediate buffers12

become functions over an infinite integer domain, with no explicit13

storage or boundary conditions. Imaging pipelines are compo-14

sitions of functions. Programmers separately specify scheduling15

strategies for the various functions composing the algorithm, which16

allows them to efficiently explore different optimizations without17

changing the algorithmic code.18

We demonstrate the power of this representation by expressing a19

range of recent image processing applications in an embedded do-20

main specific language, and compiling them for ARM, x86, and21

GPUs. Our compiler targets SIMD units, multiple cores, and com-22

plex memory hierarchies. We demonstrate that it can handle algo-23

rithms such as a camera raw pipeline, the bilateral grid, fast local24

Laplacian filtering, and image segmentation. The algorithms ex-25

pressed in our language are both shorter and faster than state-of-26

the-art implementations.27

Keywords: Image Processing, Compilers, Performance28

1 Introduction29

Computational photography algorithms require highly efficient30

implementations to be used in practice, especially on power-31

constrained mobile devices. This is not a simple matter of program-32

ming in a low-level language like C. The performance difference33

between naive C and highly optimized C is often an order of mag-34

nitude. Unfortunately, this usually comes at the cost of programmer35

pain and code complexity, as computation must be reorganized to36

achieve memory efficiency and parallelism.37

For image processing, the global organization of execution and stor-38

age is critical. Image processing pipelines are both wide and deep:39

they consist of many data-parallel stages that benefit hugely from40

parallel execution across pixels, but stages are often memory band-41

width limited—they do little work per load and store. Gains in42

speed therefore come not just from optimizing the inner loops, but43

also from global program transformations such as tiling and fusion44

that exploit producer-consumer locality down the pipeline. The best45

choice of transformations is architecture-specific; implementations46

optimized for an x86 multicore and a modern GPU often bear little47

resemblance to each other.48

In this paper we enable simpler high-performance code by sepa-49

rating the intrinsic algorithm from the decisions about how to run50

efficiently on a particular machine (Fig. 2).51

(a) Clean C++ : 9.94 ms per megapixelvoid blur(const Image &in, Image &blurred) {Image tmp(in.width(), in.height());

for (int y = 0; y < in.height(); y++)for (int x = 0; x < in.width(); x++)tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3;

for (int y = 0; y < in.height(); y++)for (int x = 0; x < in.width(); x++)blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3;

}

(b) Fast C++ (for x86) : 0.90 ms per megapixelvoid fast_blur(const Image &in, Image &blurred) {

m128i one_third = _mm_set1_epi16(21846);#pragma omp parallel forfor (int yTile = 0; yTile < in.height(); yTile += 32) {

m128i a, b, c, sum, avg;m128i tmp[(256/8)*(32+2)];

for (int xTile = 0; xTile < in.width(); xTile += 256) {m128i *tmpPtr = tmp;

for (int y = -1; y < 32+1; y++) {const uint16_t *inPtr = &(in(xTile, yTile+y));for (int x = 0; x < 256; x += 8) {a = _mm_loadu_si128(( m128i*)(inPtr-1));b = _mm_loadu_si128(( m128i*)(inPtr+1));c = _mm_load_si128(( m128i*)(inPtr));sum = _mm_add_epi16(_mm_add_epi16(a, b), c);avg = _mm_mulhi_epi16(sum, one_third);_mm_store_si128(tmpPtr++, avg);inPtr += 8;

}}tmpPtr = tmp;for (int y = 0; y < 32; y++) {

m128i *outPtr = ( m128i *)(&(blurred(xTile, yTile+y)));for (int x = 0; x < 256; x += 8) {a = _mm_load_si128(tmpPtr+(2*256)/8);b = _mm_load_si128(tmpPtr+256/8);c = _mm_load_si128(tmpPtr++);sum = _mm_add_epi16(_mm_add_epi16(a, b), c);avg = _mm_mulhi_epi16(sum, one_third);_mm_store_si128(outPtr++, avg);

}}}}}

(c) Halide : 0.90 ms per megapixelFunc halide_blur(Func in) {Func tmp, blurred;Var x, y, xi, yi;

// The algorithmtmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3;blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3;

// The scheduleblurred.tile(x, y, xi, yi, 256, 32).vectorize(xi, 8).parallel(y);tmp.chunk(x).vectorize(x, 8);

return blurred;}

Figure 1: The code at the top computes a 3x3 box filter in us-ing the composition of a 1x3 box filter and a 3x1 box filter. Usingvectorization, multithreading, tiling, and fusion, we can make thisalgorithm more than 10× faster on a quad-core x86 CPU (mid-dle). However in doing so we’ve lost readability and portability.Our compiler separates the algorithm description from its sched-ule, achieving the same performance without making the same sac-rifices (bottom). For the full details about how this test was carriedout, see the supplemental material.

1

Page 2: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

67 lines3800 ms

3 ms (1267x)CUDA GPU:

148 lines7 lines55 ms

Vectorized MATLAB:Quad-core x86:

Halide algorithm:schedule:

Quad-core x86:

Camera Raw Pipeline Local Laplacian Filter Snake Image SegmentationBilateral Grid

11 ms (42x)CUDA GPU:

122 lines472ms

34 lines6 lines80 ms

Tuned C++:Quad-core x86:

Halide algorithm:schedule:

Quad-core x86:

51 msQuad-core x86:

463 lines772 ms

Optimized NEON ASM:Nokia N900:

145 lines23 lines741 ms

Halide algorithm:schedule:

Nokia N900:

48 ms (13x)CUDA GPU:

262 lines627 ms

62 lines7 lines293 ms

C++, OpenMP+IPP:Quad-core x86:

Halide algorithm:schedule:

Quad-core x86:

3.7x shorter2.1x faster

70x faster3x shorter5.9x faster

2.75x shorter5% faster than tuned assembly

Porting to new platforms does not change the algorithm code, only the schedule

Figure 2: We compare algorithms in our prototype language, Halide, to state of the art implementations of four image processing applications,ranging from MATLAB code to highly optimized NEON vector assembly [Adams et al. 2010; Aubry et al. 2011; Paris and Durand 2009; Liet al. 2010]. Halide code is compact, modular, portable, and delivers high performance across multiple platforms. All speedups are expressedrelative to the reference implementation.

To understand the challenge of efficient image processing, consider52

a 3 × 3 box filter implemented as separate horizontal and vertical53

passes. We might write this in C++ as a sequence of two loop nests54

(Fig. 1.a). An efficient implementation on a modern CPU requires55

SIMD vectorization and multithreading. But once we start to ex-56

ploit parallelism, the algorithm becomes bottlenecked on memory57

bandwidth. Computing the entire horizontal pass before the verti-58

cal pass destroys producer-consumer locality—horizontally blurred59

intermediate values are computed long before they are consumed60

by the vertical blur pass—doubling the storage and memory band-61

width required. Exploiting locality requires interleaving the two62

stages, by tiling and fusing the loops. Tiles must be carefully sized63

for alignment, and efficient fusion requires subtleties like redun-64

dantly computing values on the overlapping boundaries of interme-65

diate tiles. The resulting implementation is over 10× faster on a66

quad-core CPU, but together, these optimizations have fused two67

simple, independent steps into a single intertwined, non-portable68

mess (Fig. 1.b).69

We believe the right answer is to separate the intrinsic algorithm—70

what is computed—from the concerns of efficient mapping to ma-71

chine execution—decisions about storage and the ordering of com-72

putation. We call these choices of how to map an algorithm onto73

resources in space and time the schedule.74

Image processing exhibits a rich space of schedules. Pipelines tend75

to be deep and heterogeneous (in contrast to signal processing or76

array-based scientific code). Efficient implementations must trade77

off between storing intermediate values, or recomputing them when78

needed. However, intentionally introducing recomputation is sel-79

dom considered by traditional compilers. In our approach, the pro-80

grammer specifies an algorithm and its schedule separately. This81

makes it easy to explore various optimization strategies without ob-82

fuscating the code or accidentally modifying the algorithm itself.83

Functional languages provide a natural model for separating the84

what from the when and where. Divorced from explicit storage,85

images are no longer arrays populated by procedures, but are in-86

stead pure functions that define the value at each point in terms of87

arithmetic, reductions, and the application of other functions. A88

functional representation also allows us to omit boundary condi-89

tions, making images functions over an infinite integer domain.90

In this representation, the algorithm only defines the value of each91

function at each point, and the schedule specifies:92

• The order in which points in the domain of a function are eval-93

uated, including the exploitation of parallelism, and mapping94

onto SIMD execution units.95

• The order in which points in the domain of one function are96

evaluated relative to points in the domain of another function.97

• The memory location into which the evaluation of a function is98

stored, including registers, scratchpad memories, and regions of99

main memory.100

• Whether a value is recomputed, or from where it is loaded, at101

each point a function is used.102

Once the programmer has specified an algorithm and a schedule,103

our compiler combines them into an efficient implementation. Op-104

timizing execution for a given architecture requires modifying the105

schedule, but not the algorithm. The representation of the schedule106

is compact (e.g. Fig. 1.c), so exploring the performance of many107

options is fast and easy. We can most flexibly schedule operations108

which are data parallel, with statically analyzable access patterns,109

but also support the reductions and bounded irregular access pat-110

terns that occur in image processing.111

In addition to this model of scheduling (Sec. 3), we present:112

• A prototype embedded language called Halide, for functional113

algorithm and schedule specification (Sec. 4).114

• A compiler which translates functional algorithms and op-115

timized schedules into efficient machine code for x86 and116

ARM, including SSE and NEON SIMD instructions, and117

CUDA GPUs, including synchronization and placement of data118

throughout the specialized memory hierarchy (Sec. 5).119

• A range of applications implemented in our language, com-120

posed of common image processing operations such as convolu-121

tions, histograms, image pyramids, and complex stencils. Using122

different schedules, we compile them into optimized programs123

for x86 and ARM CPUs, and a CUDA GPU (Sec. 6). For these124

applications, the Halide code is compact, and performance is125

state of the art (Fig. 2).126

2

Page 3: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

2 Prior Work127

Loop transformation Most compiler optimizations for numeri-128

cal programs are based on loop analysis and transformation, includ-129

ing auto-vectorization, loop interchange, fusion, and tiling. The130

polyhedral model is a powerful tool for transforming imperative131

programs [Feautrier 1991], but traditional loop optimizations do not132

consider recomputation of values: each point in each loop is com-133

puted only once. In image processing, recomputing some values—134

rather than storing, synchronizing around, and reloading them—can135

be a large performance win (Sec. 6.2), and is central to the choices136

we consider during optimization.137

Data-parallel languages Many data-parallel languages have138

been proposed. Particularly relevant in graphics, CUDA and139

OpenCL expose an imperative data-parallel programming model140

which can target both GPUs and multicore CPUs with SIMD units141

[Buck 2007; OpenCL ]. Like C, they allow the specification of very142

high performance implementations for many algorithms, but be-143

cause parallel work distribution, synchronization, and memory are144

all explicitly managed by the programmer, complex algorithms are145

often not composable in these languages, and the optimizations re-146

quired are often specific to an architecture, so code must be rewrit-147

ten for different platforms.148

Intel’s Array Building Blocks provides an embedded language for149

data-parallel array processing in C++ [ArBB ]. Like in our rep-150

resentation, whole pipelines of operations are built up and opti-151

mized globally by a compiler. It delivers impressive performance152

for many algorithms on Intel CPUs. However, the inherently im-153

perative structure—in particular the explicit specification of stor-154

age locations—fundamentally affords less flexibility in scheduling155

a given pipeline. Trading off recomputation vs. storage is challeng-156

ing in this representation, and is not considered by the compiler.157

Image processing languages Shantzis described a framework158

and runtime model for image processing systems based on graphs159

of operations which process tiles of data [Shantzis 1994]. This is160

the inspiration for many scalable and extensible image processing161

systems, including our own.162

Apple’s CoreImage and Adobe’s PixelBender include kernel lan-163

guages for specifying individual point-wise operations on images164

[CoreImage ; PixelBender ]. Kernels compile into optimized code165

for multiple architectures, including GPUs. Neither optimizes166

across graphs of kernels, which often often contain complex com-167

munication like stencils, and neither supports reductions or nested168

parallelism within kernels.169

The SPIRAL system [Puschel et al. 2005] uses a domain-specific170

language, SPL, for specifying linear signal processing operations171

independent of their schedule. Complementary mapping functions172

describe how these operations should be turned into efficient code173

for a particular architecture, similarly to our schedule specifica-174

tions. It enables high performance across a range of architectures175

for linear filtering pipelines, by making deep use of mathematical176

identities on linear filters. Computational photography algorithms177

often do not fit within a strict linear filtering model. Our work can178

be seem as an attempt to generalize this approach to a broader class179

of programs.180

Elsewhere in graphics, the real-time graphics pipeline has been a181

hugely successful abstraction precisely because the schedule is sep-182

arated from the specification of the shaders. This allows GPUs and183

drivers to efficiently execute a wide range of programs with lit-184

tle programmer control over parallelism and memory management.185

This separation of concerns is extremely effective, but it is spe-186

cific to the design of a single pipeline. That pipeline also exhibits187

different characteristics than image processing pipelines, where re-188

ductions and stencil communication are common, and kernel fusion189

is essential for efficiency. Embedded DSLs have also been used to190

specify the shaders themselves, directly inside the host C++ pro-191

gram that configures the pipeline [McCool et al. 2002].192

MATLAB is also extremely successful as a language for image pro-193

cessing. Its high level syntax enables terse expression of many al-194

gorithms, and its widely-used library of built-in functionality shows195

that the ability to compose modular library functions is invaluable196

for programmer productivity. However, simply bundling fast imple-197

mentations of individual kernels is not sufficient for fast execution198

on modern machines, where optimization across stages in a pipeline199

is essential for efficient use of parallelism and memory.200

Pan introduced an functional model for image processing much like201

our own [Elliott 2001]. In Pan, images are functions from coordi-202

nates to values. Modest differences exist (Pan’s images are func-203

tions over a continuous coordinate domain, while in ours the do-204

main is discrete), but Pan is a close sibling of our intrinsic algorithm205

representation. However, it has no corollary to our complementary206

model of scheduling and ultimate compilation. It exists only as a207

direct embedding within Haskell, and is not compiled for high per-208

formance execution.209

3 Representing Algorithms and Schedules210

We propose a functional representation for image processing211

pipelines that separates the intrinsic algorithm from the schedule212

with which it will be executed. In this section we describe the rep-213

resentation for each of these components, and how they combine to214

create a fully-specified program.215

3.1 The Intrinsic Algorithm216

Our algorithm representation is functional. Values that would be217

mutable arrays in an imperative language are instead functions from218

coordinates to values. We represent images as pure functions de-219

fined over an infinite integer domain, where the value of a function220

at a point represents the color of the corresponding pixel. Imaging221

pipelines are specified as chains of functions. Functions may either222

be simple expressions in their arguments, or reductions. The ex-223

pressions which define functions are side-effect free, and are much224

like those in any simple functional language, including:225

• Arithmetic and logical operations;226

• Loads from external images;227

• If-then-else expressions (semantically equivalent to the ?:228

ternary operator in C);229

• References to named values (which may be function arguments,230

or expressions defined by a functional let construct);231

• Calls to other functions, including external C ABI functions.232

For example, our separable 3× 3 box filter in Figure 1 is expressed233

as a chain of two functions in x, y. The first horizontally blurs the234

input; the second vertically blurs the output of the first.235

This representation is simpler than most functional languages. We236

omit higher-order functions, dynamic recursion, and richer data237

structures such as tuples and lists. Functions simply map from inte-238

ger coordinates to a scalar result. This representation is sufficient to239

represent a wide range of image processing algorithms, and these240

constraints enable extremely flexible analysis and transformation241

of algorithms during compilation. Constrained versions of more242

advanced features, such as higher-order functions and tuples, are243

3

Page 4: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

UniformImage in(UInt(8), 2);Func histogram, cdf, out;RVar rx(0, in.width()), ry(0, in.height()), ri(0, 255);Var x, y, i;

histogram(in(rx, ry))++;cdf(i) = 0;cdf(ri) = cdf(ri-1) + histogram(ri);out(x, y) = cdf(input(x, y));

Figure 3: Histogram equalization uses a reduction to compute ahistogram, a scan to integrate it into a cdf, and a point-wise op-eration to remap the input using the cdf. The iteration domainsfor the reduction and scan are expressed by the programmer usingRVars. Like all functions in our representation, histogram and cdf

are defined over an infinite domain. At entries not touched by thereduction step they are zero-valued. For cdf this is specified explic-itly. For histogram this is implicit in the ++ operator.

reintroduced as syntactic sugar, but they do not change the under-244

lying representation (Sec. 4.1).245

Reduction functions From the perspective of a caller, reduc-246

tions, such as histograms, are still scalar-valued functions over an247

infinite output domain. Their definition, however, is more than a248

simple expression in the arguments. A reduction is specified by:249

• An initial expression which specifies a value at each point in250

the output domain in terms of the function arguments.251

• A list of reduction variables, bounded by minimum and max-252

imum expressions.253

• A reduction expression, which redefines the value of the func-254

tion at a computed output coordinate as a function of the re-255

duction variables and recursive references to the same func-256

tion.257

The value at a given point in the output domain is defined by the258

last reduction expression that touched that output coordinate, given259

a lexicographic traversal of all values of the reduction variables.260

Any point which was not touched by a reduction expression has the261

value of the initial expression.262

Reduction expressions are usually recursive. For example, his-263

togram in Figure 3 defines a new value in terms of the old value264

at the same point, while cdf defines a new value in terms of the265

value to the left. While we semantically define a strict lexicographic266

traversal order over the reduction variables, many common reduc-267

tions (such histogram) are associative, and may be executed in par-268

allel given appropriate atomics. Scans like cdf are inherently more269

challenging to parallelize. We do not yet address this.270

3.2 The Schedule271

Our formulation of imaging pipelines as chains of functions inten-272

tionally omits choices of when and where these functions should be273

computed. The programmer separately specifies this using a sched-274

ule. A schedule describes not only the order of evaluation of points275

within the producer and consumer, but also what is stored and what276

is recomputed. The schedule further describes mapping onto par-277

allel execution resources such as threads, SIMD units, and GPU278

blocks. It is constrained only by the fundamental dependence be-279

tween points in different functions (values must be computed before280

they are used).281

Schedules are demand-driven: for each pipeline stage, they spec-282

ify how the inputs should be evaluated, starting from the output of283

the full pipeline. Formally, when a callee function such as tmp in284

Fig.1(c) is invoked in a caller such as blurred, we need to decide285

how to schedule it with respect to the caller.286

We currently allow four types of caller-callee relationships (Fig. 4).287

Some of them lead to additional choices, including traversal order288

and subdivision of the domain, with possibly recursive scheduling289

decisions for the sub-regions.290

Inline: Compute as needed, do not store In the simplest case,291

the callee is evaluated directly at the single point requested by the292

caller, like a function call in a traditional language. Its value at293

that point is computed from the expression which defines it, and294

passed directly into the calling expression. Reductions may not be295

inlined because they are not defined by a single expression; they296

require evaluation over the entire reduction domain before they can297

return a value. Inlining performs redundant computation whenever298

a single point is referred to in multiple places. However, even when299

it introduces significant amounts of recomputation, inlining can be300

the most efficient option. This is because image processing code is301

very often constrained by memory bandwidth and inlining passes302

values between functions without touching memory.303

Root: Precompute entire required region At the other ex-304

treme, we can compute the value of the callee for the entire subdo-305

main needed by the caller before evaluating any points in the caller.306

In our blur example, this means evaluating and storing all of the307

horizontal pass (tmp) before beginning the vertical pass (blurred).308

We call this call schedule root. Every point is computed exactly309

once, but storage and locality may be lost: the intermediate buffer310

required may be large, and points in the callee are unlikely to still311

be in a cache when they are finally used. This schedule is equiv-312

alent to the most common structure seen in naive C or MATLAB313

image processing code: each stage of the algorithm is evaluated in314

its entirety, and then stored as a whole image in memory.315

While evaluating the callee, there are further choices in the traversal316

of the required subdomain. A root schedule must specify, for each317

dimension of the subdomain, whether it is traversed:318

• sequentially,319

• in parallel,320

• unrolled by a constant factor,321

• or vectorized by a constant factor.322

The schedule also specifies the relative traversal order of the dimen-323

sions (e.g. row- vs. column-major).324

The schedule does not specify the bounds in each dimension. The325

bounds of the domain required of each stage are inferred during326

compilation (Sec. 5.2). Ultimately, these become expressions in the327

size of the requested output image. Leaving bounds specification to328

the compiler makes the algorithm and schedule simpler and more329

flexible. Explicit bounds are only required for indexing expressions330

not analyzable by the compiler, such as the result of a reduction. In331

these cases, we require the algorithm to explicitly clamp the prob-332

lematic index.333

The schedule may also split a dimension into inner and outer com-334

ponents, which can then be treated separately. For example, to rep-335

resent evaluation in tiles, we can split the x into outer and inner336

dimensions xo and xi, and similarly split y into yo and yi, which337

can then be traversed in the order yo, xo, yi, xi (as illustrated in338

the lower right of Fig. 4). After a dimension has been split, the in-339

ner and outer component must still be scheduled using any of the340

options discussed above.341

4

Page 5: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

1

11 2

323 4

5 6

2

Inline Chunk Root Reuse

Serial y, Serial x Serial x, Serial y Serial y, Vectorized x Parallel y, Vectorized x

Split x into 2xo+xi,Split y into 2yo+yi,Serial yo, xo, yi, xi

12345678

910111213141516

1718192021222324

2526272829303132

3334353637383940

4142434445464748

4950515253545556

5758596061626364

1 2 3 4 5 6 7 89 10 11 12 13 14 15 1617 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64

1 23 4

5 67 8

9 1011 12

13 1415 16

17 1819 20

21 2223 24

25 2627 28

29 3031 32

33 3435 36

37 3839 40

41 4243 44

45 4647 48

49 5051 52

53 5455 56

57 5859 60

61 6263 64

1 23 45 67 89 1011 1213 1415 16

11

22

11

22

11

22

11

22

tmp blurred tmp blurred tmp blurredtmp

blurred

Compute as needed, do not store Compute, use, then discard subregions Precompute entire required region Load from an existing buffer

Figure 4: We model scheduling an imaging pipeline as the set of choices that must be made for each stage about how to evaluate each input.Here we consider blurred’s dependence on tmp, from the example in Fig. 1. blurred may inline tmp, computing values on demand and notstoring anything for later reuse. This gives excellent temporal locality, but each point of tmp will be computed three times. blurred mayalso compute and consume tmp in larger chunks. This provides some temporal locality, and performs redundant computation at the chunkboundaries. blurred may simply compute all of tmp before using any of it. We call this root. It computes each point of tmp only once, buttemporal locality is poor—each value is unlikely to still be in cache when it is needed. Finally, if some other consumer (in green on the right)had already evaluated all of tmp as root, blurred could simply reuse that data. If blurred evaluates tmp as root or chunked, then there arefurther choices to make about the order in which to compute the given region of tmp. The choices we implement are shown at the bottom.

Splitting a dimension expands its bounds to be a multiple of the ex-342

tent of the inner dimension. Vectorizing or unrolling a dimension343

similarly rounds its extent up to the nearest multiple of the factor344

used. Such bounds expansion is always legal given our representa-345

tion of images as functions over infinite domains.346

These choices amount to specifying a complete loop nest which tra-347

verses the required region of the output domain. The schedule for a348

reduction must specify a pair of loop nests: one for its initialization349

(over the output domain), and one for its update (over the reduction350

domain). In the latter case, the bounds are given by the definition351

of the reduction, and are do not need to be inferred later.352

Chunk: Compute, use, then discard subregions Alterna-353

tively, a function can be chunked with respect to a dimension of the354

caller. Each iteration of the caller over that dimension first precom-355

putes all values of the callee needed for that iteration only. Chunk-356

ing interleaves the computation of sub-regions of the caller and the357

callee, trading off producer-consumer locality and reduced storage358

footprint for potential recomputation when chunks required for dif-359

ferent iterations of the caller overlap. Because a chunk is a region,360

it requires the same choices defining the traversal of its dimensions361

as a root schedule. Its bounds are also similarly inferred. Chunked362

call schedules, combined with split iteration dimensions, describe363

the common pattern of loop tiling and stripmining (as taken advan-364

tage of in Fig. 1).365

Reuse: Load from an existing buffer Finally, if a function is366

computed in chunks or at the root for one caller, another caller may367

reuse that evaluation. Reusing a chunked evaluation is only legal368

if it is also in scope for the new caller. Reuse is typically the best369

option when available.370

Imaging applications exhibit a fundamental tension between to-371

tal fusion down the pipeline (inline), which maximizes producer-372

consumer locality at the cost of recomputation of shared values,373

and breadth-first execution (root), which eliminates recomputation374

at the cost of locality. This is often resolved by splitting a function’s375

domain and chunking the functions upstream at a finer granular-376

ity. This achieves reuse for the inner dimensions, and producer-377

consumer locality for the outer ones. Choosing the granularity378

trades off between locality, storage footprint, and recomputation.379

A key purpose of our schedule representation is to span this contin-380

uum, so that the best choice may be made in any given context.381

3.3 The Fully Specified Program382

Lowering an intrinsic algorithm with a specific schedule produces383

a fully specified imperative program, with a defined order of oper-384

ations and placement of data. The resulting program is made up of385

ordered imperative statements, including:386

• Stores of expression values to array locations;387

• Sequential and parallel for loops, which define a range of vari-388

able values over which a contained statement should be exe-389

cuted;390

• Producer-consumer edges, which define an array to be allo-391

cated (its size given by a potentially dynamic expression), a392

block of statements which may write to it, and a block of state-393

ments which may read from it, after which it may be freed.394

This is a general imperative program representation, but we don’t395

need to analyze or transform programs in this form. Most challeng-396

ing optimization has already been performed in the lowering from397

5

Page 6: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

intrinsic algorithm to imperative program. And because the com-398

piler generates all imperative allocation and execution constructs,399

it has a deep knowledge of their semantics and constraints, which400

can be very challenging to infer from arbitrary imperative input.401

Our lowered imperative program may still contain symbolic bounds402

which need to be resolved. A final bounds inference pass infers con-403

crete bounds based on dependence between the bounds of different404

loop variables in the program (Sec. 5.2).405

4 The Language406

We construct imaging pipelines in this representation using a pro-407

totype language embedded in C++, which we call Halide. A chain408

of Halide functions can be JIT compiled and used immediately, or409

it can be compiled to an object file and header to be used by some410

other program (which need not link against Halide).411

Expressions. The basic expressions are constants, domain vari-412

ables, and calls to Halide functions. From these, we use C++413

operator overloading to build arithmetic operations, comparisons,414

and logical operations. Conditional expressions, type-casting, tran-415

scendentals, external functions, etc. are described using calls to416

provided intrinsics. For example, the expression select(x > 0,417

sqrt(cast<float>(x)), f(x+1)) returns either the square root of418

x, or the application of some Halide function f to x+1, depending on419

the sign of x. Finally, debug expressions evaluate to their first argu-420

ment, and print the remainder of their arguments at evaluation-time.421

They are useful for inspecting values in flight.422

Functions are defined in a functional programming style. The423

following code constructs a Halide function over a two dimensional424

domain that evaluates to the product of its arguments:425

Func f;Var x, y;f(x, y) = x * y;

Reductions are produced by defining a function twice: once for its426

initial value, and once for its reduction step. The reduction step427

should be in terms of reduction variables (of type RVar), which in-428

clude expressions describing their bounds. The span of all reduction429

variables referenced defines the reduction domain. The left-hand-430

side of the update definition may be a computed location rather than431

simple variables (Fig. 3).432

In many cases we can infer bounds of reduction variables based on433

their use. We can also infer reasonable initial values in common434

cases: if a reduction is a sum, the initial value defaults to zero; if it435

is a product, it defaults to one. The following code takes advantage436

of both of these features to compute a histogram over the image im:437

Func histogram;RVar x, y;histogram(im(x, y))++;

Uniforms describe the run-time parameters of an imaging438

pipeline. They may be scalars or entire images (in particular the in-439

put image). When using Halide as a JIT compiler, uniforms can be440

bound by assigning to them. Statically-compiled Halide functions441

will expose all referenced uniforms as top-level function arguments.442

The following C++ code builds a Halide function that brightens its443

input using a uniform parameter.444

// A floating point parameterUniform<float> scale;// A two-dimensional floating-point imageUniformImage input(Float(32), 2);Var x, y:

Func bright;bright(x, y) = input(x, y) * scale;

We can JIT compile and use our function immediately by calling445

realize, or we can statically compile it using compileToFile. For446

example, we can apply the above brighten function immediately:447

Image<float> im = load("input.png");input = im;scale = 2.0f;Image<float> output =bright.realize(im.width(), im.height());

Alternatively, we can statically compile with448

bright.compileToFile("bright"). This produces bright.o449

and bright.h, which together define a C callable function with the450

following type signature:451

void bright(float scale, buffer t *in, buffer t *out);

where buffer t is a bare-bones image struct defined in the same452

header.453

4.1 Syntactic Sugar454

While the constructs above are sufficient to express any Halide al-455

gorithm, functional languages typically provide other features that456

are useful in this context. We provide restricted forms of several of457

these via syntactic sugar.458

Higher-order functions. While Halide functions may only have459

integer arguments, the code that builds a pipeline may include C++460

functions that take and return Halide functions. These are effec-461

tively compile-time higher-order functions, and they let us write462

generic operations on images. For example, consider the following463

operator which shrinks an image by subsampling:464

// Return a new Halide function that subsamples fFunc subsample(Func f) {

Func g; Var x, y;g(x, y) = f(2*x, 2*y);return g;

}

C++ functions that deal in Halide expressions are also a convenient465

way to write generic code. As the host language, C++ can be used466

as a metaprogramming layer to more conveniently construct Halide467

pipelines containing repetitive substructures.468

Partial application. When performing trivial point-wise opera-469

tions on entire images, it is often clearer to omit pixel indices. For470

example if we wish to define f as equal to a plus a subsampling of471

b, then f = a + subsample(b) is clearer than f(x, y) = a(x, y)472

+ subsample(b)(x, y). We therefore support such partial applica-473

tion of Halide functions. Any operator which combines partially ap-474

plied functions is automatically lifted to point-wise operation over475

the omitted arguments.476

Tuples. We overload the C++ comma operator to allow for tuples477

of expressions. A tuple generates an anonymous function that maps478

from an index to that element of the tuple. The tuple is then treated479

as a partial application of this function. For example, given ex-480

pressions r, g, and b, the definition f(x, y) = (r, g, b) creates a481

three-dimensional function (in this case representing a color image)482

whose last argument selects between r, g, and b. It is equivalent to483

f(x, y, c) = select(c==0, r, select(c==1, g, b)).484

6

Page 7: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

Inline reductions. We provide syntax for inlining the most485

commonly-occurring reduction patterns: sum, product, maximum,486

and minimum. These simplified reduction operators use all reduction487

variables referenced within as the reduction domain. For example,488

a blurred version of some image f can be defined as follows:489

Func blurry; Var x, y;RVar i(-2, 5), j(-2, 5);blurry(x, y) = sum(f(x+i, y+j));

4.2 Specifying a Schedule490

Once the description of an algorithm is complete, the programmer491

specifies a desired partial schedule for each function. The compiler492

fills in any remaining choices using simple heuristics, and tabulates493

the scheduling decisions for each call site. The function represent-494

ing the output is scheduled as root. Other functions are scheduled495

as inline by default. This behavior can be modified by calling one496

of the two following methods:497

• im.root() schedules the first use of im as root, and schedules498

all other uses to reuse that instance.499

• im.chunk(x) schedules im as chunked over x, which must be500

some dimension of the caller of im. A similar reuse heuristic501

applies; for each unique x, only one use is scheduled as chunk,502

and the others reuse that instance.503

If im is scheduled as root or chunk, we must also specify the traver-504

sal order of the domain. By default it is traversed serially in scanline505

order. This can be modified using the following methods:506

• im.transpose(x, y) moves iteration over x outside of y in the507

traversal order (i.e., this switches from row-major to column-508

major traversal).509

• im.parallel(y) indicates that each row of im should be com-510

puted in parallel across y.511

• im.vectorized(x, k) indicates that x should be split into vec-512

tors of size k, and each vector should be executed using SIMD.513

• im.unroll(x, k) indicates that the evaluation of im should be514

unrolled across the dimension x by a factor of k.515

• im.split(x, xo, xi, k) subdivides the dimension x into outer516

and inner dimensions xo and xi, where xi ranges from zero to517

k. xo, and xi can then be marked as parallel, serial, vectorized,518

or even recursively split.519

• im.tile(x, y, xi, yi, k, l) is a convenience method that520

splits x by a factor of k, and y by a factor of l, then transposes521

the inner dimension of y with the outer dimension of x to effect522

traversal over tiles.523

• im.gpu(bx, by, tx, ty) maps execution to the CUDA model,524

by marking bx and by as corresponding to block indices, and tx525

and ty as corresponding to thread indices within each block.526

• im.gpuTile(x, y, k, l) is a similar convenience method to527

tile. It splits x and y by k and l respectively, and then maps528

the resulting four dimensions to CUDA’s notion of blocks and529

threads.530

Schedules that would require substantial transformation of code531

written in C can be specified tersely, and in a way that does not532

touch the statement of the algorithm. Furthermore, each scheduling533

method returns a reference to the function, so calls can be chained:534

e.g. im.root().vectorize(x, 4).transpose(x, y).parallel(x)535

directs the compiler to evaluate im in columns of width 4, operat-536

ing on every column in parallel, with each thread walking down its537

column serially.538

5 Compiler Implementation539

The Halide compiler lowers imaging pipelines into machine code540

for ARM, x86, and PTX. It is built on top of the LLVM com-541

piler infrastructure [LLVM ], which it uses for conventional scalar542

optimizations, register allocation, and machine code generation.543

While LLVM provides some degree of platform neutrality, the final544

stages of lowering must be architecture-specific to produce high-545

performance machine code. Compilation proceeds as shown in546

Fig. 5.547

Partial Schedule

Schedule Generation

Halide Functions

Desugaring

Lowering to imperative representation

Bounds inference

Architecture-specific LLVM bitcode

JIT-compiledfunction pointer

Statically-compiledobject file and header

Figure 5: The programmer writes a pipeline of Halide functionsand partially specifies their schedules. The compiler then removessyntactic sugar (such as tuples), generates a complete schedule,and uses it to lower the pipeline into an imperative representa-tion. Bounds inference is then performed to inject expressions thatcompute the bounds of each loop and the size of each intermediatebuffer. The representation is then further lowered to LLVM IR, andhanded off to LLVM to compile to machine code.

5.1 Lowering548

After the programmer has created an imaging pipeline and specified549

its schedule, the first role of the compiler is to transform the func-550

tional representation of the algorithm into an imperative one using551

the schedule. The schedule is tracked as a table mapping from each552

call site to its call schedule. For root and chunked schedules, it also553

contains an ordered list of dimensions to traverse, and how they554

should be traversed (serial, parallel, vectorized, unrolled) or split.555

The compiler works iteratively from the end of the pipeline up-556

wards, considering each function after all of its uses. This requires557

that the pipeline be acyclic. It first initializes a seed by generating558

the imperative code that realizes the output function over its do-559

main. It then proceeds up the pipeline, either inlining function bod-560

ies, or injecting loop nests that allocate storage and evaluate each561

function into that storage.562

The structure of each loop nest, and the location it is injected, are563

precisely specified by the schedule: a function scheduled as root564

has realization code injected at the top of the code generated so far;565

functions scheduled as chunked over some variable have realization566

code injected at the top of the body of the corresponding loop; in-567

line functions have their uses directly replaced with their function568

bodies, and functions that reuse other realizations are skipped over569

for now. Reductions are lowered into a sequential pair of loop nests:570

one for the initialization, and one for the reduction step.571

The ultimate goal of lowering is to replace calls to functions with572

loads from their realizations. We defer this until after bounds infer-573

ence.574

7

Page 8: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

5.2 Bounds Inference575

The compiler then determines the bounds of the domain over which576

each use of each function must be evaluated. These bounds are577

typically not statically known at compile time; they will almost cer-578

tainly depend on the sizes of the input and output images. The com-579

piler is responsible for injecting the appropriate code to compute580

these bounds. Working through the list of functions, the compiler581

considers all uses of each function, and derives expressions that582

give the minimum and maximum possible argument values. This is583

done using symbolic interval arithmetic. For example, consider the584

following psuedocode that uses f:585

for (i from a to b) g[i] = f(i+1) + f(i*2)

Working from the inside out it is easy to deduce that f must be586

evaluated over the range [min(a + 1, a ∗ 2),max(b + 1, b ∗ 2)],587

and so expressions that compute these are injected just before the588

realization of f. Reductions must also consider the bounds of the589

expressions that determine the location of updates.590

This analysis can fail in one of two ways. First, interval arithmetic591

can be over-conservative. If x ∈ [0, a], then interval arithmetic592

computes the bounds of x(a − x) as [0, a2], instead of the actual593

bounds [0, a2/4]. We have yet to encounter a case like this in prac-594

tice; in image processing, dependence between functions is typi-595

cally either affine or data-dependent.596

Second, the compiler may not be able to determine any bound for597

some values, e.g. a value returned by an external function. These598

cases often correspond to code that would be unsafe if implemented599

in equivalent C. Unbounded expressions used as indices cause the600

compiler to throw an error.601

In either case, the programmer can assist the compiler using min602

and max expressions to simultaneously declare and enforce the603

bounds of any troubling expression.604

Now that expressions giving the bounds of each function have been605

computed, we replace references to functions with loads from or606

stores to their realizations, and perform a conventional constant-607

folding and simplification pass. The imperative representation is608

then translated directly to LLVM IR with a few architecture-specific609

modifications.610

5.3 CPU Code Generation611

Generating machine code from our imperative representation is612

largely left to LLVM, with two caveats:613

First, LLVM IR has no concept of a parallel for loop. For the CPU614

targets we implement these by lifting the body of the for loop into a615

separate function that takes as arguments a loop index and a closure616

containing the referenced external state. At the original site of the617

loop we insert code that generates a work queue containing a sin-618

gle task representing all instances of the loop body. A thread pool619

then nibbles at this task until it is complete. If a worker thread en-620

counters a nested parallel for loop this is pushed onto the same task621

queue, with the thread that encountered it responsible for managing622

the corresponding task.623

Second, while LLVM has native vector types, it does not reliably624

generate good vector code in many cases on both ARM (target-625

ing the NEON SIMD unit) and x86 (using SSE). In these cases we626

peephole optimize patterns in our representation, replacing them627

with calls to architecture-specific intrinsics. For example, while it628

is possible to perform efficient strided vector loads on both ARM629

for small strides, naive use of LLVM compiles them as general gath-630

ers. We can leverage more information than is available to LLVM631

to generate better code.632

5.4 GPU Code Generation633

When targeting the GPU, the compiler still generates functions with634

the same calling interface: a host function which takes scalar and635

buffer arguments. We compile the Halide algorithm into a hetero-636

geneous program which manages both host and device execution.637

The schedule describes how portions of the algorithm should be638

mapped to GPU execution. It tags dimensions as corresponding639

to the grid dimensions of the GPU’s data-parallel execution model640

(threads and blocks, across up to 3 dimensions). Each of the re-641

sulting loop nests is mapped to a GPU kernel, launched over a grid642

large enough to contain the number of threads and blocks active at643

the widest point in that loop nest. Operations scheduled outside the644

kernel loop nests execute on the host CPU, using the same schedul-645

ing primitives and generating the same highly optimized x86/SSE646

code as when targeting the host CPU alone.647

Fusion is achieved by scheduling functions inline, or by chunking648

at the GPU block dimension. We can describe a wide space of649

kernel fusion choices for complex pipelines simply by changing the650

schedule.651

The host side of the generated code is responsible for managing652

most data allocation and movement, GPU kernel launch, and syn-653

chronization. Allocations scheduled outside GPU thread blocks are654

allocated in host memory, managed by the host runtime, and copied655

to GPU global memory when and if they are needed by a kernel. Al-656

locations within thread blocks are allocated in GPU shared memory,657

and allocations within threads in GPU thread-local memory.658

Finally, we allow associative reductions to be executed in parallel659

on the GPU using its native atomic operations.660

6 Applications and Evaluation661

We present four image processing applications that test different662

aspects of our approach. For each we compare both our perfor-663

mance and our implementation complexity to existing optimized664

solutions. The results are summarized in Fig. 2. The Halide source665

for each application can be found in the supplemental materials.666

All performance results are reported as the best of five runs on a667

3GHz quad-core x86 desktop, a Nokia N900 mobile phone with a668

600MHz ARM OMAP3 CPU, a dual core ARM OMAP4 develop-669

ment board (equivalent to an iPad 2), and an NVIDIA Tesla C2070670

GPU (equivalent to a mid-range consumer GPU). In all cases, the671

algorithm code does not change between targets.672

6.1 Camera Pipeline673

We implement a simple camera pipeline that converts raw data from674

the image sensor into color images (Fig. 6). This pipeline per-675

forms four tasks: hot-pixel suppression, demosaicking, color cor-676

rection, and a tone curve that applies gamma correction and con-677

trast enhancement. This reproduces the software pipeline from the678

Frankencamera [Adams et al. 2010], which was written in a heav-679

ily optimized mixture of vector intrinsics and raw ARM assembly680

targeted at the OMAP3 processor in the Nokia N900. Our code is681

shorter and simpler, while also slightly faster and portable to other682

platforms.683

The tightly bounded stencil communication down the pipeline684

makes fusion of stages to save bandwidth and storage a critical op-685

timization for this application. In the Frankencamera implemen-686

tation, the entire pipeline is computed on small tiles to take ad-687

vantage of producer-consumer locality and minimize memory foot-688

print. Within each tile, the evaluation of each stage is vectorized.689

These strategies render the algorithm illegible (see the supplemental690

8

Page 9: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

Demosaic

Denoise

Tone curve

Color correct

Figure 6: The basic camera post-processing pipeline is a feed-forward pipeline in which each stage either considers only nearbyneighbors (denoise and demosaic), or is point-wise (color correctand tone curve). The best schedule computes the entire pipelinein small tiles in order to take advantage of producer-consumer lo-cality. This introduces redundant computation in the overlappingtile boundaries, but the reduction in memory bandwidth more thanmakes up for it.

material). Portability is sacrificed completely; an entirely separate,691

slower C version of the pipeline has to be included in the Franken-692

camera source in order to be able to run the pipeline on a desktop693

processor.694

We can express the same optimizations used in the Frankencamera695

assembly, separately from the algorithm: the output is tiled, and696

each stage is computed in chunks within those tiles, and then vec-697

torized. This requires one line of scheduling choices per pipeline698

stage. With these transformations, our implementation takes 741699

ms to process a 5 megapixel raw image on a Nokia N900 running700

the Frankencamera code, while the Frankencamera implementation701

takes 772 ms. We specify the algorithm in 145 lines of code, and702

the schedule in 23 (see supplemental material). The Frankencam-703

era code uses 463 lines to specify both. Our implementation is also704

portable, whereas the Frankencamera assembly is entirely platform705

specific. The same Halide code compiles to multithreaded x86 SSE706

code, which takes 51 ms on our quad-core desktop.707

6.2 Local Laplacian Filters708

One of the most important tasks in producing compelling photo-709

graphic images is adjusting local contrast. Paris et al. [2011] intro-710

duced local Laplacian filters for this purpose. The technique was711

then modified and accelerated by Aubry et al. [2011] (Fig. 7). This712

algorithm exhibits a high degree of data parallelism, which the orig-713

inal authors took advantage of to produce an optimized implemen-714

tation using a combination of Intel Performance Primitives [IPP ]715

and OpenMP [OpenMP ].716

We implemented this algorithm in Halide, and explored several717

strategies for scheduling it efficiently on several different machines.718

The statement of the algorithm did not change during the explo-719

ration of the space of plausible schedules. We found that on several720

x86 platforms, the best performance came from a complex schedule721

involving inlining certain stages, and vectorizing and parallelizing722

the rest. Using this schedule on our quad-core desktop, processing a723

4 megapixel image takes 293 ms. On the same processor the hand-724

optimized version used by Aubry et al. takes 627 ms. The reference725

implementation requires 262 lines of C++, while in Halide the same726

algorithm is 62 lines. The schedule is specified using seven lines of727

code.728

A schedule equivalent to naive C, with all major stages scheduled729

root, performs much less redundant computation than the fastest730

schedule, but takes 1.5 seconds because it sacrifices producer-731

consumer locality and is limited by memory bandwidth. The best732

schedule on a dual core ARM OMAP4 processor is slightly differ-733

Figure 7: The local Laplacian filter enhances local contrast us-ing Gaussian and Laplacian image pyramids. The pipeline mixesimages at different resolutions with a complex network of depen-dencies. While we show three pyramid levels here, for our fourmegapixel test image we used eight.

ent. While the same stages should be inlined, vectorization is not734

worth the extra instructions, as the algorithm is bandwidth-bound735

rather than compute-bound. On the ARM processor the algorithm736

takes 5.5 seconds with vectorization and 4.2 seconds without. Naive737

evaluation takes 9.7 seconds. The best schedule for the ARM takes738

427ms on the x86—50% slower than the best x86 schedule. (A739

range of schedule choices from our exploration, along with their740

performance on several architectures, are shown at the end of this741

application’s source code in our supplemental material.)742

This algorithm maps well to the GPU, where processing the same743

four-megapixel image takes only 49 milliseconds. The best sched-744

ule evaluates most stages as root, but fully fuses (inlines) all of the745

Laplacian pyramid levels wherever they are used, trading increased746

computation for reduced bandwidth and storage. Each stage is split747

into 32x32 tiles that each map to a single CUDA block. The same748

algorithm statement then compiles to 83 total invocations of 25 dis-749

tinct CUDA kernels, combined with host CPU code that precom-750

putes lookup tables, manages device memory and data movement,751

and synchronizes the long chain of kernel invocations. Writing such752

code by hand is a daunting prospect, and would not allow for the753

rapid performance-space exploration that Halide provides.754

6.3 The Bilateral Grid755

The bilateral filter [Paris et al. 2009] is used to decompose images756

into local and global details. It is efficiently computed with the757

bilateral grid algorithm [Chen et al. 2007; Paris and Durand 2009].758

This pipeline combines three different types of operation (Fig. 8).759

First, the grid is constructed with a reduction, in which a weighted760

histogram is computed over each tile of the input. These weighted761

histograms become columns of the grid, which is then blurred with762

a small-footprint filter. Finally, the grid is sampled using trilinear763

interpolation at irregular data-dependent locations to produce the764

output image.765

We implemented this algorithm in Halide and found that the best766

schedule for the CPU simply parallelizes each stage across an ap-767

propriate axis. The only stage regular enough to benefit from vec-768

torization is the small-footprint blur, but for commonly used filter769

sizes the time taken by the blur is insignificant. Using this schedule770

on our quad-core x86 CPU, we compute a bilateral filter of a four771

megapixel input using typical filter parameters (spatial standard de-772

viation of 8 pixels, range standard deviation of 0.1), in 80 ms. In773

comparison, the moderately-optimized C++ version provided by774

Paris and Durand [2009] takes 472 ms using a single thread on the775

same machine. Our single-threaded runtime is 254 ms; some of our776

speedup is due to parallelizing, and some is due to generating supe-777

9

Page 10: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

Blurring

Slicing

Grid construction(reduction)

Figure 8: The bilateral filter smoothes detail without losing strongedges. It is useful for a variety of photographic applications includ-ing tone-mapping and local contrast enhancement. The bilateralgrid computes a fast bilateral filter by scattering the input imageonto a coarse three-dimensional grid using a reduction. This gridis blurred, and then sampled to produce the smoothed output.

rior scalar code. We use 34 lines of code to describe the algorithm,778

and 6 to describe its schedule, compared to 122 lines in the C++779

reference.780

We first tried running the same algorithm on the GPU using a sched-781

ule which performs the reduction over each tile of the input image782

on a single CUDA block, with each thread within a block respon-783

sible for one input pixel. Halide detected the parallel reduction,784

and automatically inserted atomic floating point adds to memory.785

The speed was unimpressive—only 2× faster than our optimized786

CPU code, due to the high degree of contention. We modified the787

schedule to use one thread per tile of the input, with each thread788

walking serially over the reduction domain. This one-line change789

in schedule gives us a runtime of 11ms for the same four megapixel790

image. Halide still conservatively injects an atomic add operation,791

but with no contention this is in fact 10% faster than non-atomic792

floating point read-modify-writes to the same memory location, as793

it offloads the addition to a separate unit. The same 34-line Halide794

algorithm now runs over 40× faster than the more verbose refer-795

ence C++ implementation.796

6.4 Image Segmentation using Level Sets797

Active contour selection (a.k.a. snake [Kass et al. 1988]) is a798

method for segmenting objects from a background (Fig.9). It799

is well suited for medical applications. We implemented the al-800

gorithm proposed by Li et al. [2010]. The algorithm is iterative,801

and can be interpreted as a gradient-descent optimization of a 2D802

function. Each update of this function is composed of three terms803

(Fig. 9), each of them being a combination of differential quanti-804

ties computed with small 3 × 1 and 1 × 3 stencils, and point-wise805

nonlinear operations, such as normalizing the gradients.806

We factored this algorithm into three feed-forward pipelines. Two807

simple pipelines create images that are invariant to the optimization808

loop, and one primary pipeline performs a single iteration of the op-809

timization loop. While Halide can represent bounded iteration over810

the outer loop using a reduction, it is more naturally expressed in811

the imperative host language. We construct and chain together these812

pipelines at runtime using Halide as a just-in-time compiler in order813

to perform a fair evaluation against the reference implementation814

from Li et al., which is written in MATLAB. MATLAB is notori-815

ously slow when misused, but this code expresses all operations in816

the data-parallel array-wise notation that MATLAB executes most817

efficiently.818

On a 1600 × 1200 test image, our Halide implementation takes819

55 ms per iteration of the optimization loop on our quad-core x86,820

whereas the MATLAB implementation takes 3.8 seconds. Our821

Figure 9: Adaptive contours segment objects from the background.Level-set approaches are particularly useful to cope with smoothobjects and when the number of elements is unknown. The algo-rithm iterates a series of differential operators and nonlinear func-tions to progressively refine the selection. The final result is a setof curves that tightly delineate the objects of interest (in red on theright image).

schedule is expressed in a single line: we parallelize and vector-822

ize the output of each iteration, while leaving every other function823

to be inlined by default. The bulk of the speedup comes not from824

vectorizing or parallelizing; without them, our implementation still825

takes just 202 ms per iteration. The biggest difference is that we826

have completely fused the operations that make up one iteration.827

MATLAB expresses algorithms as sequences of many simple array-828

wise operations, and is heavily limited by memory bandwidth. It is829

equivalent to scheduling every operation as root, which is a poor830

choice for algorithms like this one.831

The fully-fused form of this algorithm is also ideal for the GPU,832

where it takes 3ms per iteration.833

6.5 Discussion and Future Work834

The performance gains we’ve found in this section demonstrate the835

feasibility and power of separating algorithms from their sched-836

ules. Changing the schedule enables a single algorithm definition837

to achieve high performance on a diversity of targets. On a sin-838

gle machine, it enables rapid performance space exploration. Fur-839

thermore, the algorithm specification becomes considerably more840

concise once scheduling concerns are separated.841

While the set of scheduling choices we enumerate proved sufficient842

for these applications, there are other interesting options that our843

representation could incorporate, such as sliding window schedules844

in which multiple evaluations are interleaved to reduce storage, or845

dynamic schedules in which functions are computed lazily and then846

cached for reuse. We are also exploring autotuning and heuristic847

optimization enabled by our ability to enumerate the space of le-848

gal schedules. We further believe we can continue to clarify the849

algorithm specification with more aggressive inference.850

Some image processing algorithms include constructs beyond the851

capabilities of our current representation, such as non-image data852

structures like lists and graphs, and optimization algorithms that use853

iteration-until-convergence. We believe that these and other pat-854

terns can also be decoupled from their schedules, but this remains855

future work.856

7 Conclusion857

Image processing pipelines are simultaneously deep and wide; they858

contain many simple stages that operate on large amounts of data.859

This makes the gap between naive schedules and highly parallel860

execution that makes efficient use of the memory hierarchy large—861

often an order of magnitude. And speed matters for image process-862

10

Page 11: Decoupling Algorithms from Schedulesor Easy …people.csail.mit.edu/fredo/tmp/Halide-print.pdfOnline Submission ID: 0141 Decoupling Algorithms from Schedules for Easy Optimization

Online Submission ID: 0141

ing. People expect image processing that is interactive, that runs on863

their cell phone or their camera. An order of magnitude in speed is864

often the difference between an algorithm being used in practice, or865

not at all.866

With existing tools, closing this gap requires ninja programming867

skills; imaging pipelines must be painstakingly globally trans-868

formed to simultaneously maximize parallelism and memory-869

efficiency. The resulting code is often impossible to modify, reuse,870

or port efficiently to other processors. In this paper we have demon-871

strated that it is possible to earn this order of magnitude with less872

programmer pain, by separately specifying the algorithm and its873

schedule—the decisions about ordering of computation and storage874

that are critical for performance but irrelevant to correctness. De-875

coupling the algorithm from its schedule has allowed us to compile876

simple expressions of complex image processing pipelines to im-877

plementations with state-of-the-art performance across a diversity878

of devices.879

References880

ADAMS, A., TALVALA, E.-V., PARK, S. H., JACOBS, D. E.,881

AJDIN, B., GELFAND, N., DOLSON, J., VAQUERO, D., BAEK,882

J., TICO, M., LENSCH, H. P. A., MATUSIK, W., PULLI, K.,883

HOROWITZ, M., AND LEVOY, M. 2010. The Frankencamera:884

An experimental platform for computational photography. ACM885

Transactions on Graphics 29, 4 (July), 29:1–29:12.886

ARBB. Intel array building blocks. http://software.intel.com/887

en-us/articles/intel-array-building-blocks/.888

AUBRY, M., PARIS, S., HASINOFF, S. W., KAUTZ, J., AND DU-889

RAND, F. 2011. Fast and robust pyramid-based image process-890

ing. Tech. Rep. MIT-CSAIL-TR-2011-049, Massachusetts Insti-891

tute of Technology.892

BUCK, I. 2007. GPU computing: Programming a massively par-893

allel processor. In CGO ’07: Proceedings of the International894

Symposium on Code Generation and Optimization, IEEE Com-895

puter Society, Washington, DC, USA, 17.896

CHEN, J., PARIS, S., AND DURAND, F. 2007. Real-time edge-897

aware image processing with the bilateral grid. ACM Transac-898

tions on Graphics 26, 3 (July), 103:1–103:9.899

COREIMAGE. Apple core image programming guide.900

http://developer.apple.com/library/mac/#documentation/901

GraphicsImaging/Conceptual/CoreImaging/ci_intro/ci_902

intro.html.903

ELLIOTT, C. 2001. Functional image synthesis. In Proceedings of904

Bridges.905

FEAUTRIER, P. 1991. Dataflow analysis of array and scalar refer-906

ences. International Journal of Parallel Programming 20.907

IPP. Intel integrated performance primitives. http://software.908

intel.com/en-us/articles/intel-ipp/.909

KASS, M., WITKIN, A., AND TERZOPOULOS, D. 1988. Snakes:910

Active contour models. International Journal of Computer Vi-911

sion 1, 4.912

LI, C., XU, C., GUI, C., AND FOX, M. D. 2010. Distance reg-913

ularized level set evolution and its application to image segmen-914

tation. IEEE Transactions on Image Processing 19, 12 (Decem-915

ber), 3243–3254.916

LLVM. The LLVM compiler infrastructure. http://llvm.org.917

MCCOOL, M. D., QIN, Z., AND POPA, T. S. 2002. Shader918

metaprogramming. In Graphics Hardware 2002, 57–68.919

OPENCL. The opencl specification, version 1.2. http://www.920

khronos.org/registry/cl/specs/opencl-1.2.pdf.921

OPENMP. OpenMP. http://openmp.org/.922

PARIS, S., AND DURAND, F. 2009. A fast approximation of the923

bilateral filter using a signal processing approach. International924

Journal of Computer Vision 81, 1, 24–52.925

PARIS, S., KORNPROBST, P., TUMBLIN, J., AND DURAND, F.926

2009. Bilateral filtering: Theory and applications. Foundations927

and Trends in Computer Graphics and Vision.928

PARIS, S., HASINOFF, S. W., AND KAUTZ, J. 2011. Local lapla-929

cian filters: Edge-aware image processing with a laplacian pyra-930

mid. ACM Transactions on Graphics 30, 4.931

PIXELBENDER. Adobe pixel bender reference. http://www.932

adobe.com/content/dam/Adobe/en/devnet/pixelbender/933

pdfs/pixelbender_reference.pdf.934

PUSCHEL, M., MOURA, J. M. F., JOHNSON, J., PADUA, D.,935

VELOSO, M., SINGER, B., XIONG, J., FRANCHETTI, F.,936

GACIC, A., VORONENKO, Y., CHEN, K., JOHNSON, R. W.,937

AND RIZZOLO, N. 2005. SPIRAL: Code generation for DSP938

transforms. Proceedings of the IEEE, special issue on “Program939

Generation, Optimization, and Adaptation” 93, 2, 232– 275.940

SHANTZIS, M. A. 1994. A model for efficient and flexible image941

computing. In Proceedings of the 21st annual conference on942

Computer graphics and interactive techniques, ACM, New York,943

NY, USA, SIGGRAPH ’94, 147–154.944

11