decoupling algorithms from schedulesor easy …people.csail.mit.edu/fredo/tmp/halide-print.pdfonline...
TRANSCRIPT
Online Submission ID: 0141
Decoupling Algorithms from Schedulesfor Easy Optimization of Image Processing Pipelines
Abstract1
Using existing programming tools, writing high-performance im-2
age processing code requires sacrificing readability, portability, and3
modularity. We argue that this is a consequence of conflating what4
computations define the algorithm, with decisions about storage5
and the order of computation. We refer to these latter two concerns6
as the schedule, including choices of tiling, fusion, recomputation7
vs. storage, vectorization, and parallelism.8
We propose a representation for feed-forward imaging pipelines9
that separates the algorithm from its schedule, enabling high-10
performance without sacrificing code clarity. This decoupling sim-11
plifies the algorithm specification: images and intermediate buffers12
become functions over an infinite integer domain, with no explicit13
storage or boundary conditions. Imaging pipelines are compo-14
sitions of functions. Programmers separately specify scheduling15
strategies for the various functions composing the algorithm, which16
allows them to efficiently explore different optimizations without17
changing the algorithmic code.18
We demonstrate the power of this representation by expressing a19
range of recent image processing applications in an embedded do-20
main specific language, and compiling them for ARM, x86, and21
GPUs. Our compiler targets SIMD units, multiple cores, and com-22
plex memory hierarchies. We demonstrate that it can handle algo-23
rithms such as a camera raw pipeline, the bilateral grid, fast local24
Laplacian filtering, and image segmentation. The algorithms ex-25
pressed in our language are both shorter and faster than state-of-26
the-art implementations.27
Keywords: Image Processing, Compilers, Performance28
1 Introduction29
Computational photography algorithms require highly efficient30
implementations to be used in practice, especially on power-31
constrained mobile devices. This is not a simple matter of program-32
ming in a low-level language like C. The performance difference33
between naive C and highly optimized C is often an order of mag-34
nitude. Unfortunately, this usually comes at the cost of programmer35
pain and code complexity, as computation must be reorganized to36
achieve memory efficiency and parallelism.37
For image processing, the global organization of execution and stor-38
age is critical. Image processing pipelines are both wide and deep:39
they consist of many data-parallel stages that benefit hugely from40
parallel execution across pixels, but stages are often memory band-41
width limited—they do little work per load and store. Gains in42
speed therefore come not just from optimizing the inner loops, but43
also from global program transformations such as tiling and fusion44
that exploit producer-consumer locality down the pipeline. The best45
choice of transformations is architecture-specific; implementations46
optimized for an x86 multicore and a modern GPU often bear little47
resemblance to each other.48
In this paper we enable simpler high-performance code by sepa-49
rating the intrinsic algorithm from the decisions about how to run50
efficiently on a particular machine (Fig. 2).51
(a) Clean C++ : 9.94 ms per megapixelvoid blur(const Image &in, Image &blurred) {Image tmp(in.width(), in.height());
for (int y = 0; y < in.height(); y++)for (int x = 0; x < in.width(); x++)tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3;
for (int y = 0; y < in.height(); y++)for (int x = 0; x < in.width(); x++)blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3;
}
(b) Fast C++ (for x86) : 0.90 ms per megapixelvoid fast_blur(const Image &in, Image &blurred) {
m128i one_third = _mm_set1_epi16(21846);#pragma omp parallel forfor (int yTile = 0; yTile < in.height(); yTile += 32) {
m128i a, b, c, sum, avg;m128i tmp[(256/8)*(32+2)];
for (int xTile = 0; xTile < in.width(); xTile += 256) {m128i *tmpPtr = tmp;
for (int y = -1; y < 32+1; y++) {const uint16_t *inPtr = &(in(xTile, yTile+y));for (int x = 0; x < 256; x += 8) {a = _mm_loadu_si128(( m128i*)(inPtr-1));b = _mm_loadu_si128(( m128i*)(inPtr+1));c = _mm_load_si128(( m128i*)(inPtr));sum = _mm_add_epi16(_mm_add_epi16(a, b), c);avg = _mm_mulhi_epi16(sum, one_third);_mm_store_si128(tmpPtr++, avg);inPtr += 8;
}}tmpPtr = tmp;for (int y = 0; y < 32; y++) {
m128i *outPtr = ( m128i *)(&(blurred(xTile, yTile+y)));for (int x = 0; x < 256; x += 8) {a = _mm_load_si128(tmpPtr+(2*256)/8);b = _mm_load_si128(tmpPtr+256/8);c = _mm_load_si128(tmpPtr++);sum = _mm_add_epi16(_mm_add_epi16(a, b), c);avg = _mm_mulhi_epi16(sum, one_third);_mm_store_si128(outPtr++, avg);
}}}}}
(c) Halide : 0.90 ms per megapixelFunc halide_blur(Func in) {Func tmp, blurred;Var x, y, xi, yi;
// The algorithmtmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3;blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3;
// The scheduleblurred.tile(x, y, xi, yi, 256, 32).vectorize(xi, 8).parallel(y);tmp.chunk(x).vectorize(x, 8);
return blurred;}
Figure 1: The code at the top computes a 3x3 box filter in us-ing the composition of a 1x3 box filter and a 3x1 box filter. Usingvectorization, multithreading, tiling, and fusion, we can make thisalgorithm more than 10× faster on a quad-core x86 CPU (mid-dle). However in doing so we’ve lost readability and portability.Our compiler separates the algorithm description from its sched-ule, achieving the same performance without making the same sac-rifices (bottom). For the full details about how this test was carriedout, see the supplemental material.
1
Online Submission ID: 0141
67 lines3800 ms
3 ms (1267x)CUDA GPU:
148 lines7 lines55 ms
Vectorized MATLAB:Quad-core x86:
Halide algorithm:schedule:
Quad-core x86:
Camera Raw Pipeline Local Laplacian Filter Snake Image SegmentationBilateral Grid
11 ms (42x)CUDA GPU:
122 lines472ms
34 lines6 lines80 ms
Tuned C++:Quad-core x86:
Halide algorithm:schedule:
Quad-core x86:
51 msQuad-core x86:
463 lines772 ms
Optimized NEON ASM:Nokia N900:
145 lines23 lines741 ms
Halide algorithm:schedule:
Nokia N900:
48 ms (13x)CUDA GPU:
262 lines627 ms
62 lines7 lines293 ms
C++, OpenMP+IPP:Quad-core x86:
Halide algorithm:schedule:
Quad-core x86:
3.7x shorter2.1x faster
70x faster3x shorter5.9x faster
2.75x shorter5% faster than tuned assembly
Porting to new platforms does not change the algorithm code, only the schedule
Figure 2: We compare algorithms in our prototype language, Halide, to state of the art implementations of four image processing applications,ranging from MATLAB code to highly optimized NEON vector assembly [Adams et al. 2010; Aubry et al. 2011; Paris and Durand 2009; Liet al. 2010]. Halide code is compact, modular, portable, and delivers high performance across multiple platforms. All speedups are expressedrelative to the reference implementation.
To understand the challenge of efficient image processing, consider52
a 3 × 3 box filter implemented as separate horizontal and vertical53
passes. We might write this in C++ as a sequence of two loop nests54
(Fig. 1.a). An efficient implementation on a modern CPU requires55
SIMD vectorization and multithreading. But once we start to ex-56
ploit parallelism, the algorithm becomes bottlenecked on memory57
bandwidth. Computing the entire horizontal pass before the verti-58
cal pass destroys producer-consumer locality—horizontally blurred59
intermediate values are computed long before they are consumed60
by the vertical blur pass—doubling the storage and memory band-61
width required. Exploiting locality requires interleaving the two62
stages, by tiling and fusing the loops. Tiles must be carefully sized63
for alignment, and efficient fusion requires subtleties like redun-64
dantly computing values on the overlapping boundaries of interme-65
diate tiles. The resulting implementation is over 10× faster on a66
quad-core CPU, but together, these optimizations have fused two67
simple, independent steps into a single intertwined, non-portable68
mess (Fig. 1.b).69
We believe the right answer is to separate the intrinsic algorithm—70
what is computed—from the concerns of efficient mapping to ma-71
chine execution—decisions about storage and the ordering of com-72
putation. We call these choices of how to map an algorithm onto73
resources in space and time the schedule.74
Image processing exhibits a rich space of schedules. Pipelines tend75
to be deep and heterogeneous (in contrast to signal processing or76
array-based scientific code). Efficient implementations must trade77
off between storing intermediate values, or recomputing them when78
needed. However, intentionally introducing recomputation is sel-79
dom considered by traditional compilers. In our approach, the pro-80
grammer specifies an algorithm and its schedule separately. This81
makes it easy to explore various optimization strategies without ob-82
fuscating the code or accidentally modifying the algorithm itself.83
Functional languages provide a natural model for separating the84
what from the when and where. Divorced from explicit storage,85
images are no longer arrays populated by procedures, but are in-86
stead pure functions that define the value at each point in terms of87
arithmetic, reductions, and the application of other functions. A88
functional representation also allows us to omit boundary condi-89
tions, making images functions over an infinite integer domain.90
In this representation, the algorithm only defines the value of each91
function at each point, and the schedule specifies:92
• The order in which points in the domain of a function are eval-93
uated, including the exploitation of parallelism, and mapping94
onto SIMD execution units.95
• The order in which points in the domain of one function are96
evaluated relative to points in the domain of another function.97
• The memory location into which the evaluation of a function is98
stored, including registers, scratchpad memories, and regions of99
main memory.100
• Whether a value is recomputed, or from where it is loaded, at101
each point a function is used.102
Once the programmer has specified an algorithm and a schedule,103
our compiler combines them into an efficient implementation. Op-104
timizing execution for a given architecture requires modifying the105
schedule, but not the algorithm. The representation of the schedule106
is compact (e.g. Fig. 1.c), so exploring the performance of many107
options is fast and easy. We can most flexibly schedule operations108
which are data parallel, with statically analyzable access patterns,109
but also support the reductions and bounded irregular access pat-110
terns that occur in image processing.111
In addition to this model of scheduling (Sec. 3), we present:112
• A prototype embedded language called Halide, for functional113
algorithm and schedule specification (Sec. 4).114
• A compiler which translates functional algorithms and op-115
timized schedules into efficient machine code for x86 and116
ARM, including SSE and NEON SIMD instructions, and117
CUDA GPUs, including synchronization and placement of data118
throughout the specialized memory hierarchy (Sec. 5).119
• A range of applications implemented in our language, com-120
posed of common image processing operations such as convolu-121
tions, histograms, image pyramids, and complex stencils. Using122
different schedules, we compile them into optimized programs123
for x86 and ARM CPUs, and a CUDA GPU (Sec. 6). For these124
applications, the Halide code is compact, and performance is125
state of the art (Fig. 2).126
2
Online Submission ID: 0141
2 Prior Work127
Loop transformation Most compiler optimizations for numeri-128
cal programs are based on loop analysis and transformation, includ-129
ing auto-vectorization, loop interchange, fusion, and tiling. The130
polyhedral model is a powerful tool for transforming imperative131
programs [Feautrier 1991], but traditional loop optimizations do not132
consider recomputation of values: each point in each loop is com-133
puted only once. In image processing, recomputing some values—134
rather than storing, synchronizing around, and reloading them—can135
be a large performance win (Sec. 6.2), and is central to the choices136
we consider during optimization.137
Data-parallel languages Many data-parallel languages have138
been proposed. Particularly relevant in graphics, CUDA and139
OpenCL expose an imperative data-parallel programming model140
which can target both GPUs and multicore CPUs with SIMD units141
[Buck 2007; OpenCL ]. Like C, they allow the specification of very142
high performance implementations for many algorithms, but be-143
cause parallel work distribution, synchronization, and memory are144
all explicitly managed by the programmer, complex algorithms are145
often not composable in these languages, and the optimizations re-146
quired are often specific to an architecture, so code must be rewrit-147
ten for different platforms.148
Intel’s Array Building Blocks provides an embedded language for149
data-parallel array processing in C++ [ArBB ]. Like in our rep-150
resentation, whole pipelines of operations are built up and opti-151
mized globally by a compiler. It delivers impressive performance152
for many algorithms on Intel CPUs. However, the inherently im-153
perative structure—in particular the explicit specification of stor-154
age locations—fundamentally affords less flexibility in scheduling155
a given pipeline. Trading off recomputation vs. storage is challeng-156
ing in this representation, and is not considered by the compiler.157
Image processing languages Shantzis described a framework158
and runtime model for image processing systems based on graphs159
of operations which process tiles of data [Shantzis 1994]. This is160
the inspiration for many scalable and extensible image processing161
systems, including our own.162
Apple’s CoreImage and Adobe’s PixelBender include kernel lan-163
guages for specifying individual point-wise operations on images164
[CoreImage ; PixelBender ]. Kernels compile into optimized code165
for multiple architectures, including GPUs. Neither optimizes166
across graphs of kernels, which often often contain complex com-167
munication like stencils, and neither supports reductions or nested168
parallelism within kernels.169
The SPIRAL system [Puschel et al. 2005] uses a domain-specific170
language, SPL, for specifying linear signal processing operations171
independent of their schedule. Complementary mapping functions172
describe how these operations should be turned into efficient code173
for a particular architecture, similarly to our schedule specifica-174
tions. It enables high performance across a range of architectures175
for linear filtering pipelines, by making deep use of mathematical176
identities on linear filters. Computational photography algorithms177
often do not fit within a strict linear filtering model. Our work can178
be seem as an attempt to generalize this approach to a broader class179
of programs.180
Elsewhere in graphics, the real-time graphics pipeline has been a181
hugely successful abstraction precisely because the schedule is sep-182
arated from the specification of the shaders. This allows GPUs and183
drivers to efficiently execute a wide range of programs with lit-184
tle programmer control over parallelism and memory management.185
This separation of concerns is extremely effective, but it is spe-186
cific to the design of a single pipeline. That pipeline also exhibits187
different characteristics than image processing pipelines, where re-188
ductions and stencil communication are common, and kernel fusion189
is essential for efficiency. Embedded DSLs have also been used to190
specify the shaders themselves, directly inside the host C++ pro-191
gram that configures the pipeline [McCool et al. 2002].192
MATLAB is also extremely successful as a language for image pro-193
cessing. Its high level syntax enables terse expression of many al-194
gorithms, and its widely-used library of built-in functionality shows195
that the ability to compose modular library functions is invaluable196
for programmer productivity. However, simply bundling fast imple-197
mentations of individual kernels is not sufficient for fast execution198
on modern machines, where optimization across stages in a pipeline199
is essential for efficient use of parallelism and memory.200
Pan introduced an functional model for image processing much like201
our own [Elliott 2001]. In Pan, images are functions from coordi-202
nates to values. Modest differences exist (Pan’s images are func-203
tions over a continuous coordinate domain, while in ours the do-204
main is discrete), but Pan is a close sibling of our intrinsic algorithm205
representation. However, it has no corollary to our complementary206
model of scheduling and ultimate compilation. It exists only as a207
direct embedding within Haskell, and is not compiled for high per-208
formance execution.209
3 Representing Algorithms and Schedules210
We propose a functional representation for image processing211
pipelines that separates the intrinsic algorithm from the schedule212
with which it will be executed. In this section we describe the rep-213
resentation for each of these components, and how they combine to214
create a fully-specified program.215
3.1 The Intrinsic Algorithm216
Our algorithm representation is functional. Values that would be217
mutable arrays in an imperative language are instead functions from218
coordinates to values. We represent images as pure functions de-219
fined over an infinite integer domain, where the value of a function220
at a point represents the color of the corresponding pixel. Imaging221
pipelines are specified as chains of functions. Functions may either222
be simple expressions in their arguments, or reductions. The ex-223
pressions which define functions are side-effect free, and are much224
like those in any simple functional language, including:225
• Arithmetic and logical operations;226
• Loads from external images;227
• If-then-else expressions (semantically equivalent to the ?:228
ternary operator in C);229
• References to named values (which may be function arguments,230
or expressions defined by a functional let construct);231
• Calls to other functions, including external C ABI functions.232
For example, our separable 3× 3 box filter in Figure 1 is expressed233
as a chain of two functions in x, y. The first horizontally blurs the234
input; the second vertically blurs the output of the first.235
This representation is simpler than most functional languages. We236
omit higher-order functions, dynamic recursion, and richer data237
structures such as tuples and lists. Functions simply map from inte-238
ger coordinates to a scalar result. This representation is sufficient to239
represent a wide range of image processing algorithms, and these240
constraints enable extremely flexible analysis and transformation241
of algorithms during compilation. Constrained versions of more242
advanced features, such as higher-order functions and tuples, are243
3
Online Submission ID: 0141
UniformImage in(UInt(8), 2);Func histogram, cdf, out;RVar rx(0, in.width()), ry(0, in.height()), ri(0, 255);Var x, y, i;
histogram(in(rx, ry))++;cdf(i) = 0;cdf(ri) = cdf(ri-1) + histogram(ri);out(x, y) = cdf(input(x, y));
Figure 3: Histogram equalization uses a reduction to compute ahistogram, a scan to integrate it into a cdf, and a point-wise op-eration to remap the input using the cdf. The iteration domainsfor the reduction and scan are expressed by the programmer usingRVars. Like all functions in our representation, histogram and cdf
are defined over an infinite domain. At entries not touched by thereduction step they are zero-valued. For cdf this is specified explic-itly. For histogram this is implicit in the ++ operator.
reintroduced as syntactic sugar, but they do not change the under-244
lying representation (Sec. 4.1).245
Reduction functions From the perspective of a caller, reduc-246
tions, such as histograms, are still scalar-valued functions over an247
infinite output domain. Their definition, however, is more than a248
simple expression in the arguments. A reduction is specified by:249
• An initial expression which specifies a value at each point in250
the output domain in terms of the function arguments.251
• A list of reduction variables, bounded by minimum and max-252
imum expressions.253
• A reduction expression, which redefines the value of the func-254
tion at a computed output coordinate as a function of the re-255
duction variables and recursive references to the same func-256
tion.257
The value at a given point in the output domain is defined by the258
last reduction expression that touched that output coordinate, given259
a lexicographic traversal of all values of the reduction variables.260
Any point which was not touched by a reduction expression has the261
value of the initial expression.262
Reduction expressions are usually recursive. For example, his-263
togram in Figure 3 defines a new value in terms of the old value264
at the same point, while cdf defines a new value in terms of the265
value to the left. While we semantically define a strict lexicographic266
traversal order over the reduction variables, many common reduc-267
tions (such histogram) are associative, and may be executed in par-268
allel given appropriate atomics. Scans like cdf are inherently more269
challenging to parallelize. We do not yet address this.270
3.2 The Schedule271
Our formulation of imaging pipelines as chains of functions inten-272
tionally omits choices of when and where these functions should be273
computed. The programmer separately specifies this using a sched-274
ule. A schedule describes not only the order of evaluation of points275
within the producer and consumer, but also what is stored and what276
is recomputed. The schedule further describes mapping onto par-277
allel execution resources such as threads, SIMD units, and GPU278
blocks. It is constrained only by the fundamental dependence be-279
tween points in different functions (values must be computed before280
they are used).281
Schedules are demand-driven: for each pipeline stage, they spec-282
ify how the inputs should be evaluated, starting from the output of283
the full pipeline. Formally, when a callee function such as tmp in284
Fig.1(c) is invoked in a caller such as blurred, we need to decide285
how to schedule it with respect to the caller.286
We currently allow four types of caller-callee relationships (Fig. 4).287
Some of them lead to additional choices, including traversal order288
and subdivision of the domain, with possibly recursive scheduling289
decisions for the sub-regions.290
Inline: Compute as needed, do not store In the simplest case,291
the callee is evaluated directly at the single point requested by the292
caller, like a function call in a traditional language. Its value at293
that point is computed from the expression which defines it, and294
passed directly into the calling expression. Reductions may not be295
inlined because they are not defined by a single expression; they296
require evaluation over the entire reduction domain before they can297
return a value. Inlining performs redundant computation whenever298
a single point is referred to in multiple places. However, even when299
it introduces significant amounts of recomputation, inlining can be300
the most efficient option. This is because image processing code is301
very often constrained by memory bandwidth and inlining passes302
values between functions without touching memory.303
Root: Precompute entire required region At the other ex-304
treme, we can compute the value of the callee for the entire subdo-305
main needed by the caller before evaluating any points in the caller.306
In our blur example, this means evaluating and storing all of the307
horizontal pass (tmp) before beginning the vertical pass (blurred).308
We call this call schedule root. Every point is computed exactly309
once, but storage and locality may be lost: the intermediate buffer310
required may be large, and points in the callee are unlikely to still311
be in a cache when they are finally used. This schedule is equiv-312
alent to the most common structure seen in naive C or MATLAB313
image processing code: each stage of the algorithm is evaluated in314
its entirety, and then stored as a whole image in memory.315
While evaluating the callee, there are further choices in the traversal316
of the required subdomain. A root schedule must specify, for each317
dimension of the subdomain, whether it is traversed:318
• sequentially,319
• in parallel,320
• unrolled by a constant factor,321
• or vectorized by a constant factor.322
The schedule also specifies the relative traversal order of the dimen-323
sions (e.g. row- vs. column-major).324
The schedule does not specify the bounds in each dimension. The325
bounds of the domain required of each stage are inferred during326
compilation (Sec. 5.2). Ultimately, these become expressions in the327
size of the requested output image. Leaving bounds specification to328
the compiler makes the algorithm and schedule simpler and more329
flexible. Explicit bounds are only required for indexing expressions330
not analyzable by the compiler, such as the result of a reduction. In331
these cases, we require the algorithm to explicitly clamp the prob-332
lematic index.333
The schedule may also split a dimension into inner and outer com-334
ponents, which can then be treated separately. For example, to rep-335
resent evaluation in tiles, we can split the x into outer and inner336
dimensions xo and xi, and similarly split y into yo and yi, which337
can then be traversed in the order yo, xo, yi, xi (as illustrated in338
the lower right of Fig. 4). After a dimension has been split, the in-339
ner and outer component must still be scheduled using any of the340
options discussed above.341
4
Online Submission ID: 0141
1
11 2
323 4
5 6
2
Inline Chunk Root Reuse
Serial y, Serial x Serial x, Serial y Serial y, Vectorized x Parallel y, Vectorized x
Split x into 2xo+xi,Split y into 2yo+yi,Serial yo, xo, yi, xi
12345678
910111213141516
1718192021222324
2526272829303132
3334353637383940
4142434445464748
4950515253545556
5758596061626364
1 2 3 4 5 6 7 89 10 11 12 13 14 15 1617 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64
1 23 4
5 67 8
9 1011 12
13 1415 16
17 1819 20
21 2223 24
25 2627 28
29 3031 32
33 3435 36
37 3839 40
41 4243 44
45 4647 48
49 5051 52
53 5455 56
57 5859 60
61 6263 64
1 23 45 67 89 1011 1213 1415 16
11
22
11
22
11
22
11
22
tmp blurred tmp blurred tmp blurredtmp
blurred
Compute as needed, do not store Compute, use, then discard subregions Precompute entire required region Load from an existing buffer
Figure 4: We model scheduling an imaging pipeline as the set of choices that must be made for each stage about how to evaluate each input.Here we consider blurred’s dependence on tmp, from the example in Fig. 1. blurred may inline tmp, computing values on demand and notstoring anything for later reuse. This gives excellent temporal locality, but each point of tmp will be computed three times. blurred mayalso compute and consume tmp in larger chunks. This provides some temporal locality, and performs redundant computation at the chunkboundaries. blurred may simply compute all of tmp before using any of it. We call this root. It computes each point of tmp only once, buttemporal locality is poor—each value is unlikely to still be in cache when it is needed. Finally, if some other consumer (in green on the right)had already evaluated all of tmp as root, blurred could simply reuse that data. If blurred evaluates tmp as root or chunked, then there arefurther choices to make about the order in which to compute the given region of tmp. The choices we implement are shown at the bottom.
Splitting a dimension expands its bounds to be a multiple of the ex-342
tent of the inner dimension. Vectorizing or unrolling a dimension343
similarly rounds its extent up to the nearest multiple of the factor344
used. Such bounds expansion is always legal given our representa-345
tion of images as functions over infinite domains.346
These choices amount to specifying a complete loop nest which tra-347
verses the required region of the output domain. The schedule for a348
reduction must specify a pair of loop nests: one for its initialization349
(over the output domain), and one for its update (over the reduction350
domain). In the latter case, the bounds are given by the definition351
of the reduction, and are do not need to be inferred later.352
Chunk: Compute, use, then discard subregions Alterna-353
tively, a function can be chunked with respect to a dimension of the354
caller. Each iteration of the caller over that dimension first precom-355
putes all values of the callee needed for that iteration only. Chunk-356
ing interleaves the computation of sub-regions of the caller and the357
callee, trading off producer-consumer locality and reduced storage358
footprint for potential recomputation when chunks required for dif-359
ferent iterations of the caller overlap. Because a chunk is a region,360
it requires the same choices defining the traversal of its dimensions361
as a root schedule. Its bounds are also similarly inferred. Chunked362
call schedules, combined with split iteration dimensions, describe363
the common pattern of loop tiling and stripmining (as taken advan-364
tage of in Fig. 1).365
Reuse: Load from an existing buffer Finally, if a function is366
computed in chunks or at the root for one caller, another caller may367
reuse that evaluation. Reusing a chunked evaluation is only legal368
if it is also in scope for the new caller. Reuse is typically the best369
option when available.370
Imaging applications exhibit a fundamental tension between to-371
tal fusion down the pipeline (inline), which maximizes producer-372
consumer locality at the cost of recomputation of shared values,373
and breadth-first execution (root), which eliminates recomputation374
at the cost of locality. This is often resolved by splitting a function’s375
domain and chunking the functions upstream at a finer granular-376
ity. This achieves reuse for the inner dimensions, and producer-377
consumer locality for the outer ones. Choosing the granularity378
trades off between locality, storage footprint, and recomputation.379
A key purpose of our schedule representation is to span this contin-380
uum, so that the best choice may be made in any given context.381
3.3 The Fully Specified Program382
Lowering an intrinsic algorithm with a specific schedule produces383
a fully specified imperative program, with a defined order of oper-384
ations and placement of data. The resulting program is made up of385
ordered imperative statements, including:386
• Stores of expression values to array locations;387
• Sequential and parallel for loops, which define a range of vari-388
able values over which a contained statement should be exe-389
cuted;390
• Producer-consumer edges, which define an array to be allo-391
cated (its size given by a potentially dynamic expression), a392
block of statements which may write to it, and a block of state-393
ments which may read from it, after which it may be freed.394
This is a general imperative program representation, but we don’t395
need to analyze or transform programs in this form. Most challeng-396
ing optimization has already been performed in the lowering from397
5
Online Submission ID: 0141
intrinsic algorithm to imperative program. And because the com-398
piler generates all imperative allocation and execution constructs,399
it has a deep knowledge of their semantics and constraints, which400
can be very challenging to infer from arbitrary imperative input.401
Our lowered imperative program may still contain symbolic bounds402
which need to be resolved. A final bounds inference pass infers con-403
crete bounds based on dependence between the bounds of different404
loop variables in the program (Sec. 5.2).405
4 The Language406
We construct imaging pipelines in this representation using a pro-407
totype language embedded in C++, which we call Halide. A chain408
of Halide functions can be JIT compiled and used immediately, or409
it can be compiled to an object file and header to be used by some410
other program (which need not link against Halide).411
Expressions. The basic expressions are constants, domain vari-412
ables, and calls to Halide functions. From these, we use C++413
operator overloading to build arithmetic operations, comparisons,414
and logical operations. Conditional expressions, type-casting, tran-415
scendentals, external functions, etc. are described using calls to416
provided intrinsics. For example, the expression select(x > 0,417
sqrt(cast<float>(x)), f(x+1)) returns either the square root of418
x, or the application of some Halide function f to x+1, depending on419
the sign of x. Finally, debug expressions evaluate to their first argu-420
ment, and print the remainder of their arguments at evaluation-time.421
They are useful for inspecting values in flight.422
Functions are defined in a functional programming style. The423
following code constructs a Halide function over a two dimensional424
domain that evaluates to the product of its arguments:425
Func f;Var x, y;f(x, y) = x * y;
Reductions are produced by defining a function twice: once for its426
initial value, and once for its reduction step. The reduction step427
should be in terms of reduction variables (of type RVar), which in-428
clude expressions describing their bounds. The span of all reduction429
variables referenced defines the reduction domain. The left-hand-430
side of the update definition may be a computed location rather than431
simple variables (Fig. 3).432
In many cases we can infer bounds of reduction variables based on433
their use. We can also infer reasonable initial values in common434
cases: if a reduction is a sum, the initial value defaults to zero; if it435
is a product, it defaults to one. The following code takes advantage436
of both of these features to compute a histogram over the image im:437
Func histogram;RVar x, y;histogram(im(x, y))++;
Uniforms describe the run-time parameters of an imaging438
pipeline. They may be scalars or entire images (in particular the in-439
put image). When using Halide as a JIT compiler, uniforms can be440
bound by assigning to them. Statically-compiled Halide functions441
will expose all referenced uniforms as top-level function arguments.442
The following C++ code builds a Halide function that brightens its443
input using a uniform parameter.444
// A floating point parameterUniform<float> scale;// A two-dimensional floating-point imageUniformImage input(Float(32), 2);Var x, y:
Func bright;bright(x, y) = input(x, y) * scale;
We can JIT compile and use our function immediately by calling445
realize, or we can statically compile it using compileToFile. For446
example, we can apply the above brighten function immediately:447
Image<float> im = load("input.png");input = im;scale = 2.0f;Image<float> output =bright.realize(im.width(), im.height());
Alternatively, we can statically compile with448
bright.compileToFile("bright"). This produces bright.o449
and bright.h, which together define a C callable function with the450
following type signature:451
void bright(float scale, buffer t *in, buffer t *out);
where buffer t is a bare-bones image struct defined in the same452
header.453
4.1 Syntactic Sugar454
While the constructs above are sufficient to express any Halide al-455
gorithm, functional languages typically provide other features that456
are useful in this context. We provide restricted forms of several of457
these via syntactic sugar.458
Higher-order functions. While Halide functions may only have459
integer arguments, the code that builds a pipeline may include C++460
functions that take and return Halide functions. These are effec-461
tively compile-time higher-order functions, and they let us write462
generic operations on images. For example, consider the following463
operator which shrinks an image by subsampling:464
// Return a new Halide function that subsamples fFunc subsample(Func f) {
Func g; Var x, y;g(x, y) = f(2*x, 2*y);return g;
}
C++ functions that deal in Halide expressions are also a convenient465
way to write generic code. As the host language, C++ can be used466
as a metaprogramming layer to more conveniently construct Halide467
pipelines containing repetitive substructures.468
Partial application. When performing trivial point-wise opera-469
tions on entire images, it is often clearer to omit pixel indices. For470
example if we wish to define f as equal to a plus a subsampling of471
b, then f = a + subsample(b) is clearer than f(x, y) = a(x, y)472
+ subsample(b)(x, y). We therefore support such partial applica-473
tion of Halide functions. Any operator which combines partially ap-474
plied functions is automatically lifted to point-wise operation over475
the omitted arguments.476
Tuples. We overload the C++ comma operator to allow for tuples477
of expressions. A tuple generates an anonymous function that maps478
from an index to that element of the tuple. The tuple is then treated479
as a partial application of this function. For example, given ex-480
pressions r, g, and b, the definition f(x, y) = (r, g, b) creates a481
three-dimensional function (in this case representing a color image)482
whose last argument selects between r, g, and b. It is equivalent to483
f(x, y, c) = select(c==0, r, select(c==1, g, b)).484
6
Online Submission ID: 0141
Inline reductions. We provide syntax for inlining the most485
commonly-occurring reduction patterns: sum, product, maximum,486
and minimum. These simplified reduction operators use all reduction487
variables referenced within as the reduction domain. For example,488
a blurred version of some image f can be defined as follows:489
Func blurry; Var x, y;RVar i(-2, 5), j(-2, 5);blurry(x, y) = sum(f(x+i, y+j));
4.2 Specifying a Schedule490
Once the description of an algorithm is complete, the programmer491
specifies a desired partial schedule for each function. The compiler492
fills in any remaining choices using simple heuristics, and tabulates493
the scheduling decisions for each call site. The function represent-494
ing the output is scheduled as root. Other functions are scheduled495
as inline by default. This behavior can be modified by calling one496
of the two following methods:497
• im.root() schedules the first use of im as root, and schedules498
all other uses to reuse that instance.499
• im.chunk(x) schedules im as chunked over x, which must be500
some dimension of the caller of im. A similar reuse heuristic501
applies; for each unique x, only one use is scheduled as chunk,502
and the others reuse that instance.503
If im is scheduled as root or chunk, we must also specify the traver-504
sal order of the domain. By default it is traversed serially in scanline505
order. This can be modified using the following methods:506
• im.transpose(x, y) moves iteration over x outside of y in the507
traversal order (i.e., this switches from row-major to column-508
major traversal).509
• im.parallel(y) indicates that each row of im should be com-510
puted in parallel across y.511
• im.vectorized(x, k) indicates that x should be split into vec-512
tors of size k, and each vector should be executed using SIMD.513
• im.unroll(x, k) indicates that the evaluation of im should be514
unrolled across the dimension x by a factor of k.515
• im.split(x, xo, xi, k) subdivides the dimension x into outer516
and inner dimensions xo and xi, where xi ranges from zero to517
k. xo, and xi can then be marked as parallel, serial, vectorized,518
or even recursively split.519
• im.tile(x, y, xi, yi, k, l) is a convenience method that520
splits x by a factor of k, and y by a factor of l, then transposes521
the inner dimension of y with the outer dimension of x to effect522
traversal over tiles.523
• im.gpu(bx, by, tx, ty) maps execution to the CUDA model,524
by marking bx and by as corresponding to block indices, and tx525
and ty as corresponding to thread indices within each block.526
• im.gpuTile(x, y, k, l) is a similar convenience method to527
tile. It splits x and y by k and l respectively, and then maps528
the resulting four dimensions to CUDA’s notion of blocks and529
threads.530
Schedules that would require substantial transformation of code531
written in C can be specified tersely, and in a way that does not532
touch the statement of the algorithm. Furthermore, each scheduling533
method returns a reference to the function, so calls can be chained:534
e.g. im.root().vectorize(x, 4).transpose(x, y).parallel(x)535
directs the compiler to evaluate im in columns of width 4, operat-536
ing on every column in parallel, with each thread walking down its537
column serially.538
5 Compiler Implementation539
The Halide compiler lowers imaging pipelines into machine code540
for ARM, x86, and PTX. It is built on top of the LLVM com-541
piler infrastructure [LLVM ], which it uses for conventional scalar542
optimizations, register allocation, and machine code generation.543
While LLVM provides some degree of platform neutrality, the final544
stages of lowering must be architecture-specific to produce high-545
performance machine code. Compilation proceeds as shown in546
Fig. 5.547
Partial Schedule
Schedule Generation
Halide Functions
Desugaring
Lowering to imperative representation
Bounds inference
Architecture-specific LLVM bitcode
JIT-compiledfunction pointer
Statically-compiledobject file and header
Figure 5: The programmer writes a pipeline of Halide functionsand partially specifies their schedules. The compiler then removessyntactic sugar (such as tuples), generates a complete schedule,and uses it to lower the pipeline into an imperative representa-tion. Bounds inference is then performed to inject expressions thatcompute the bounds of each loop and the size of each intermediatebuffer. The representation is then further lowered to LLVM IR, andhanded off to LLVM to compile to machine code.
5.1 Lowering548
After the programmer has created an imaging pipeline and specified549
its schedule, the first role of the compiler is to transform the func-550
tional representation of the algorithm into an imperative one using551
the schedule. The schedule is tracked as a table mapping from each552
call site to its call schedule. For root and chunked schedules, it also553
contains an ordered list of dimensions to traverse, and how they554
should be traversed (serial, parallel, vectorized, unrolled) or split.555
The compiler works iteratively from the end of the pipeline up-556
wards, considering each function after all of its uses. This requires557
that the pipeline be acyclic. It first initializes a seed by generating558
the imperative code that realizes the output function over its do-559
main. It then proceeds up the pipeline, either inlining function bod-560
ies, or injecting loop nests that allocate storage and evaluate each561
function into that storage.562
The structure of each loop nest, and the location it is injected, are563
precisely specified by the schedule: a function scheduled as root564
has realization code injected at the top of the code generated so far;565
functions scheduled as chunked over some variable have realization566
code injected at the top of the body of the corresponding loop; in-567
line functions have their uses directly replaced with their function568
bodies, and functions that reuse other realizations are skipped over569
for now. Reductions are lowered into a sequential pair of loop nests:570
one for the initialization, and one for the reduction step.571
The ultimate goal of lowering is to replace calls to functions with572
loads from their realizations. We defer this until after bounds infer-573
ence.574
7
Online Submission ID: 0141
5.2 Bounds Inference575
The compiler then determines the bounds of the domain over which576
each use of each function must be evaluated. These bounds are577
typically not statically known at compile time; they will almost cer-578
tainly depend on the sizes of the input and output images. The com-579
piler is responsible for injecting the appropriate code to compute580
these bounds. Working through the list of functions, the compiler581
considers all uses of each function, and derives expressions that582
give the minimum and maximum possible argument values. This is583
done using symbolic interval arithmetic. For example, consider the584
following psuedocode that uses f:585
for (i from a to b) g[i] = f(i+1) + f(i*2)
Working from the inside out it is easy to deduce that f must be586
evaluated over the range [min(a + 1, a ∗ 2),max(b + 1, b ∗ 2)],587
and so expressions that compute these are injected just before the588
realization of f. Reductions must also consider the bounds of the589
expressions that determine the location of updates.590
This analysis can fail in one of two ways. First, interval arithmetic591
can be over-conservative. If x ∈ [0, a], then interval arithmetic592
computes the bounds of x(a − x) as [0, a2], instead of the actual593
bounds [0, a2/4]. We have yet to encounter a case like this in prac-594
tice; in image processing, dependence between functions is typi-595
cally either affine or data-dependent.596
Second, the compiler may not be able to determine any bound for597
some values, e.g. a value returned by an external function. These598
cases often correspond to code that would be unsafe if implemented599
in equivalent C. Unbounded expressions used as indices cause the600
compiler to throw an error.601
In either case, the programmer can assist the compiler using min602
and max expressions to simultaneously declare and enforce the603
bounds of any troubling expression.604
Now that expressions giving the bounds of each function have been605
computed, we replace references to functions with loads from or606
stores to their realizations, and perform a conventional constant-607
folding and simplification pass. The imperative representation is608
then translated directly to LLVM IR with a few architecture-specific609
modifications.610
5.3 CPU Code Generation611
Generating machine code from our imperative representation is612
largely left to LLVM, with two caveats:613
First, LLVM IR has no concept of a parallel for loop. For the CPU614
targets we implement these by lifting the body of the for loop into a615
separate function that takes as arguments a loop index and a closure616
containing the referenced external state. At the original site of the617
loop we insert code that generates a work queue containing a sin-618
gle task representing all instances of the loop body. A thread pool619
then nibbles at this task until it is complete. If a worker thread en-620
counters a nested parallel for loop this is pushed onto the same task621
queue, with the thread that encountered it responsible for managing622
the corresponding task.623
Second, while LLVM has native vector types, it does not reliably624
generate good vector code in many cases on both ARM (target-625
ing the NEON SIMD unit) and x86 (using SSE). In these cases we626
peephole optimize patterns in our representation, replacing them627
with calls to architecture-specific intrinsics. For example, while it628
is possible to perform efficient strided vector loads on both ARM629
for small strides, naive use of LLVM compiles them as general gath-630
ers. We can leverage more information than is available to LLVM631
to generate better code.632
5.4 GPU Code Generation633
When targeting the GPU, the compiler still generates functions with634
the same calling interface: a host function which takes scalar and635
buffer arguments. We compile the Halide algorithm into a hetero-636
geneous program which manages both host and device execution.637
The schedule describes how portions of the algorithm should be638
mapped to GPU execution. It tags dimensions as corresponding639
to the grid dimensions of the GPU’s data-parallel execution model640
(threads and blocks, across up to 3 dimensions). Each of the re-641
sulting loop nests is mapped to a GPU kernel, launched over a grid642
large enough to contain the number of threads and blocks active at643
the widest point in that loop nest. Operations scheduled outside the644
kernel loop nests execute on the host CPU, using the same schedul-645
ing primitives and generating the same highly optimized x86/SSE646
code as when targeting the host CPU alone.647
Fusion is achieved by scheduling functions inline, or by chunking648
at the GPU block dimension. We can describe a wide space of649
kernel fusion choices for complex pipelines simply by changing the650
schedule.651
The host side of the generated code is responsible for managing652
most data allocation and movement, GPU kernel launch, and syn-653
chronization. Allocations scheduled outside GPU thread blocks are654
allocated in host memory, managed by the host runtime, and copied655
to GPU global memory when and if they are needed by a kernel. Al-656
locations within thread blocks are allocated in GPU shared memory,657
and allocations within threads in GPU thread-local memory.658
Finally, we allow associative reductions to be executed in parallel659
on the GPU using its native atomic operations.660
6 Applications and Evaluation661
We present four image processing applications that test different662
aspects of our approach. For each we compare both our perfor-663
mance and our implementation complexity to existing optimized664
solutions. The results are summarized in Fig. 2. The Halide source665
for each application can be found in the supplemental materials.666
All performance results are reported as the best of five runs on a667
3GHz quad-core x86 desktop, a Nokia N900 mobile phone with a668
600MHz ARM OMAP3 CPU, a dual core ARM OMAP4 develop-669
ment board (equivalent to an iPad 2), and an NVIDIA Tesla C2070670
GPU (equivalent to a mid-range consumer GPU). In all cases, the671
algorithm code does not change between targets.672
6.1 Camera Pipeline673
We implement a simple camera pipeline that converts raw data from674
the image sensor into color images (Fig. 6). This pipeline per-675
forms four tasks: hot-pixel suppression, demosaicking, color cor-676
rection, and a tone curve that applies gamma correction and con-677
trast enhancement. This reproduces the software pipeline from the678
Frankencamera [Adams et al. 2010], which was written in a heav-679
ily optimized mixture of vector intrinsics and raw ARM assembly680
targeted at the OMAP3 processor in the Nokia N900. Our code is681
shorter and simpler, while also slightly faster and portable to other682
platforms.683
The tightly bounded stencil communication down the pipeline684
makes fusion of stages to save bandwidth and storage a critical op-685
timization for this application. In the Frankencamera implemen-686
tation, the entire pipeline is computed on small tiles to take ad-687
vantage of producer-consumer locality and minimize memory foot-688
print. Within each tile, the evaluation of each stage is vectorized.689
These strategies render the algorithm illegible (see the supplemental690
8
Online Submission ID: 0141
Demosaic
Denoise
Tone curve
Color correct
Figure 6: The basic camera post-processing pipeline is a feed-forward pipeline in which each stage either considers only nearbyneighbors (denoise and demosaic), or is point-wise (color correctand tone curve). The best schedule computes the entire pipelinein small tiles in order to take advantage of producer-consumer lo-cality. This introduces redundant computation in the overlappingtile boundaries, but the reduction in memory bandwidth more thanmakes up for it.
material). Portability is sacrificed completely; an entirely separate,691
slower C version of the pipeline has to be included in the Franken-692
camera source in order to be able to run the pipeline on a desktop693
processor.694
We can express the same optimizations used in the Frankencamera695
assembly, separately from the algorithm: the output is tiled, and696
each stage is computed in chunks within those tiles, and then vec-697
torized. This requires one line of scheduling choices per pipeline698
stage. With these transformations, our implementation takes 741699
ms to process a 5 megapixel raw image on a Nokia N900 running700
the Frankencamera code, while the Frankencamera implementation701
takes 772 ms. We specify the algorithm in 145 lines of code, and702
the schedule in 23 (see supplemental material). The Frankencam-703
era code uses 463 lines to specify both. Our implementation is also704
portable, whereas the Frankencamera assembly is entirely platform705
specific. The same Halide code compiles to multithreaded x86 SSE706
code, which takes 51 ms on our quad-core desktop.707
6.2 Local Laplacian Filters708
One of the most important tasks in producing compelling photo-709
graphic images is adjusting local contrast. Paris et al. [2011] intro-710
duced local Laplacian filters for this purpose. The technique was711
then modified and accelerated by Aubry et al. [2011] (Fig. 7). This712
algorithm exhibits a high degree of data parallelism, which the orig-713
inal authors took advantage of to produce an optimized implemen-714
tation using a combination of Intel Performance Primitives [IPP ]715
and OpenMP [OpenMP ].716
We implemented this algorithm in Halide, and explored several717
strategies for scheduling it efficiently on several different machines.718
The statement of the algorithm did not change during the explo-719
ration of the space of plausible schedules. We found that on several720
x86 platforms, the best performance came from a complex schedule721
involving inlining certain stages, and vectorizing and parallelizing722
the rest. Using this schedule on our quad-core desktop, processing a723
4 megapixel image takes 293 ms. On the same processor the hand-724
optimized version used by Aubry et al. takes 627 ms. The reference725
implementation requires 262 lines of C++, while in Halide the same726
algorithm is 62 lines. The schedule is specified using seven lines of727
code.728
A schedule equivalent to naive C, with all major stages scheduled729
root, performs much less redundant computation than the fastest730
schedule, but takes 1.5 seconds because it sacrifices producer-731
consumer locality and is limited by memory bandwidth. The best732
schedule on a dual core ARM OMAP4 processor is slightly differ-733
Figure 7: The local Laplacian filter enhances local contrast us-ing Gaussian and Laplacian image pyramids. The pipeline mixesimages at different resolutions with a complex network of depen-dencies. While we show three pyramid levels here, for our fourmegapixel test image we used eight.
ent. While the same stages should be inlined, vectorization is not734
worth the extra instructions, as the algorithm is bandwidth-bound735
rather than compute-bound. On the ARM processor the algorithm736
takes 5.5 seconds with vectorization and 4.2 seconds without. Naive737
evaluation takes 9.7 seconds. The best schedule for the ARM takes738
427ms on the x86—50% slower than the best x86 schedule. (A739
range of schedule choices from our exploration, along with their740
performance on several architectures, are shown at the end of this741
application’s source code in our supplemental material.)742
This algorithm maps well to the GPU, where processing the same743
four-megapixel image takes only 49 milliseconds. The best sched-744
ule evaluates most stages as root, but fully fuses (inlines) all of the745
Laplacian pyramid levels wherever they are used, trading increased746
computation for reduced bandwidth and storage. Each stage is split747
into 32x32 tiles that each map to a single CUDA block. The same748
algorithm statement then compiles to 83 total invocations of 25 dis-749
tinct CUDA kernels, combined with host CPU code that precom-750
putes lookup tables, manages device memory and data movement,751
and synchronizes the long chain of kernel invocations. Writing such752
code by hand is a daunting prospect, and would not allow for the753
rapid performance-space exploration that Halide provides.754
6.3 The Bilateral Grid755
The bilateral filter [Paris et al. 2009] is used to decompose images756
into local and global details. It is efficiently computed with the757
bilateral grid algorithm [Chen et al. 2007; Paris and Durand 2009].758
This pipeline combines three different types of operation (Fig. 8).759
First, the grid is constructed with a reduction, in which a weighted760
histogram is computed over each tile of the input. These weighted761
histograms become columns of the grid, which is then blurred with762
a small-footprint filter. Finally, the grid is sampled using trilinear763
interpolation at irregular data-dependent locations to produce the764
output image.765
We implemented this algorithm in Halide and found that the best766
schedule for the CPU simply parallelizes each stage across an ap-767
propriate axis. The only stage regular enough to benefit from vec-768
torization is the small-footprint blur, but for commonly used filter769
sizes the time taken by the blur is insignificant. Using this schedule770
on our quad-core x86 CPU, we compute a bilateral filter of a four771
megapixel input using typical filter parameters (spatial standard de-772
viation of 8 pixels, range standard deviation of 0.1), in 80 ms. In773
comparison, the moderately-optimized C++ version provided by774
Paris and Durand [2009] takes 472 ms using a single thread on the775
same machine. Our single-threaded runtime is 254 ms; some of our776
speedup is due to parallelizing, and some is due to generating supe-777
9
Online Submission ID: 0141
Blurring
Slicing
Grid construction(reduction)
Figure 8: The bilateral filter smoothes detail without losing strongedges. It is useful for a variety of photographic applications includ-ing tone-mapping and local contrast enhancement. The bilateralgrid computes a fast bilateral filter by scattering the input imageonto a coarse three-dimensional grid using a reduction. This gridis blurred, and then sampled to produce the smoothed output.
rior scalar code. We use 34 lines of code to describe the algorithm,778
and 6 to describe its schedule, compared to 122 lines in the C++779
reference.780
We first tried running the same algorithm on the GPU using a sched-781
ule which performs the reduction over each tile of the input image782
on a single CUDA block, with each thread within a block respon-783
sible for one input pixel. Halide detected the parallel reduction,784
and automatically inserted atomic floating point adds to memory.785
The speed was unimpressive—only 2× faster than our optimized786
CPU code, due to the high degree of contention. We modified the787
schedule to use one thread per tile of the input, with each thread788
walking serially over the reduction domain. This one-line change789
in schedule gives us a runtime of 11ms for the same four megapixel790
image. Halide still conservatively injects an atomic add operation,791
but with no contention this is in fact 10% faster than non-atomic792
floating point read-modify-writes to the same memory location, as793
it offloads the addition to a separate unit. The same 34-line Halide794
algorithm now runs over 40× faster than the more verbose refer-795
ence C++ implementation.796
6.4 Image Segmentation using Level Sets797
Active contour selection (a.k.a. snake [Kass et al. 1988]) is a798
method for segmenting objects from a background (Fig.9). It799
is well suited for medical applications. We implemented the al-800
gorithm proposed by Li et al. [2010]. The algorithm is iterative,801
and can be interpreted as a gradient-descent optimization of a 2D802
function. Each update of this function is composed of three terms803
(Fig. 9), each of them being a combination of differential quanti-804
ties computed with small 3 × 1 and 1 × 3 stencils, and point-wise805
nonlinear operations, such as normalizing the gradients.806
We factored this algorithm into three feed-forward pipelines. Two807
simple pipelines create images that are invariant to the optimization808
loop, and one primary pipeline performs a single iteration of the op-809
timization loop. While Halide can represent bounded iteration over810
the outer loop using a reduction, it is more naturally expressed in811
the imperative host language. We construct and chain together these812
pipelines at runtime using Halide as a just-in-time compiler in order813
to perform a fair evaluation against the reference implementation814
from Li et al., which is written in MATLAB. MATLAB is notori-815
ously slow when misused, but this code expresses all operations in816
the data-parallel array-wise notation that MATLAB executes most817
efficiently.818
On a 1600 × 1200 test image, our Halide implementation takes819
55 ms per iteration of the optimization loop on our quad-core x86,820
whereas the MATLAB implementation takes 3.8 seconds. Our821
Figure 9: Adaptive contours segment objects from the background.Level-set approaches are particularly useful to cope with smoothobjects and when the number of elements is unknown. The algo-rithm iterates a series of differential operators and nonlinear func-tions to progressively refine the selection. The final result is a setof curves that tightly delineate the objects of interest (in red on theright image).
schedule is expressed in a single line: we parallelize and vector-822
ize the output of each iteration, while leaving every other function823
to be inlined by default. The bulk of the speedup comes not from824
vectorizing or parallelizing; without them, our implementation still825
takes just 202 ms per iteration. The biggest difference is that we826
have completely fused the operations that make up one iteration.827
MATLAB expresses algorithms as sequences of many simple array-828
wise operations, and is heavily limited by memory bandwidth. It is829
equivalent to scheduling every operation as root, which is a poor830
choice for algorithms like this one.831
The fully-fused form of this algorithm is also ideal for the GPU,832
where it takes 3ms per iteration.833
6.5 Discussion and Future Work834
The performance gains we’ve found in this section demonstrate the835
feasibility and power of separating algorithms from their sched-836
ules. Changing the schedule enables a single algorithm definition837
to achieve high performance on a diversity of targets. On a sin-838
gle machine, it enables rapid performance space exploration. Fur-839
thermore, the algorithm specification becomes considerably more840
concise once scheduling concerns are separated.841
While the set of scheduling choices we enumerate proved sufficient842
for these applications, there are other interesting options that our843
representation could incorporate, such as sliding window schedules844
in which multiple evaluations are interleaved to reduce storage, or845
dynamic schedules in which functions are computed lazily and then846
cached for reuse. We are also exploring autotuning and heuristic847
optimization enabled by our ability to enumerate the space of le-848
gal schedules. We further believe we can continue to clarify the849
algorithm specification with more aggressive inference.850
Some image processing algorithms include constructs beyond the851
capabilities of our current representation, such as non-image data852
structures like lists and graphs, and optimization algorithms that use853
iteration-until-convergence. We believe that these and other pat-854
terns can also be decoupled from their schedules, but this remains855
future work.856
7 Conclusion857
Image processing pipelines are simultaneously deep and wide; they858
contain many simple stages that operate on large amounts of data.859
This makes the gap between naive schedules and highly parallel860
execution that makes efficient use of the memory hierarchy large—861
often an order of magnitude. And speed matters for image process-862
10
Online Submission ID: 0141
ing. People expect image processing that is interactive, that runs on863
their cell phone or their camera. An order of magnitude in speed is864
often the difference between an algorithm being used in practice, or865
not at all.866
With existing tools, closing this gap requires ninja programming867
skills; imaging pipelines must be painstakingly globally trans-868
formed to simultaneously maximize parallelism and memory-869
efficiency. The resulting code is often impossible to modify, reuse,870
or port efficiently to other processors. In this paper we have demon-871
strated that it is possible to earn this order of magnitude with less872
programmer pain, by separately specifying the algorithm and its873
schedule—the decisions about ordering of computation and storage874
that are critical for performance but irrelevant to correctness. De-875
coupling the algorithm from its schedule has allowed us to compile876
simple expressions of complex image processing pipelines to im-877
plementations with state-of-the-art performance across a diversity878
of devices.879
References880
ADAMS, A., TALVALA, E.-V., PARK, S. H., JACOBS, D. E.,881
AJDIN, B., GELFAND, N., DOLSON, J., VAQUERO, D., BAEK,882
J., TICO, M., LENSCH, H. P. A., MATUSIK, W., PULLI, K.,883
HOROWITZ, M., AND LEVOY, M. 2010. The Frankencamera:884
An experimental platform for computational photography. ACM885
Transactions on Graphics 29, 4 (July), 29:1–29:12.886
ARBB. Intel array building blocks. http://software.intel.com/887
en-us/articles/intel-array-building-blocks/.888
AUBRY, M., PARIS, S., HASINOFF, S. W., KAUTZ, J., AND DU-889
RAND, F. 2011. Fast and robust pyramid-based image process-890
ing. Tech. Rep. MIT-CSAIL-TR-2011-049, Massachusetts Insti-891
tute of Technology.892
BUCK, I. 2007. GPU computing: Programming a massively par-893
allel processor. In CGO ’07: Proceedings of the International894
Symposium on Code Generation and Optimization, IEEE Com-895
puter Society, Washington, DC, USA, 17.896
CHEN, J., PARIS, S., AND DURAND, F. 2007. Real-time edge-897
aware image processing with the bilateral grid. ACM Transac-898
tions on Graphics 26, 3 (July), 103:1–103:9.899
COREIMAGE. Apple core image programming guide.900
http://developer.apple.com/library/mac/#documentation/901
GraphicsImaging/Conceptual/CoreImaging/ci_intro/ci_902
intro.html.903
ELLIOTT, C. 2001. Functional image synthesis. In Proceedings of904
Bridges.905
FEAUTRIER, P. 1991. Dataflow analysis of array and scalar refer-906
ences. International Journal of Parallel Programming 20.907
IPP. Intel integrated performance primitives. http://software.908
intel.com/en-us/articles/intel-ipp/.909
KASS, M., WITKIN, A., AND TERZOPOULOS, D. 1988. Snakes:910
Active contour models. International Journal of Computer Vi-911
sion 1, 4.912
LI, C., XU, C., GUI, C., AND FOX, M. D. 2010. Distance reg-913
ularized level set evolution and its application to image segmen-914
tation. IEEE Transactions on Image Processing 19, 12 (Decem-915
ber), 3243–3254.916
LLVM. The LLVM compiler infrastructure. http://llvm.org.917
MCCOOL, M. D., QIN, Z., AND POPA, T. S. 2002. Shader918
metaprogramming. In Graphics Hardware 2002, 57–68.919
OPENCL. The opencl specification, version 1.2. http://www.920
khronos.org/registry/cl/specs/opencl-1.2.pdf.921
OPENMP. OpenMP. http://openmp.org/.922
PARIS, S., AND DURAND, F. 2009. A fast approximation of the923
bilateral filter using a signal processing approach. International924
Journal of Computer Vision 81, 1, 24–52.925
PARIS, S., KORNPROBST, P., TUMBLIN, J., AND DURAND, F.926
2009. Bilateral filtering: Theory and applications. Foundations927
and Trends in Computer Graphics and Vision.928
PARIS, S., HASINOFF, S. W., AND KAUTZ, J. 2011. Local lapla-929
cian filters: Edge-aware image processing with a laplacian pyra-930
mid. ACM Transactions on Graphics 30, 4.931
PIXELBENDER. Adobe pixel bender reference. http://www.932
adobe.com/content/dam/Adobe/en/devnet/pixelbender/933
pdfs/pixelbender_reference.pdf.934
PUSCHEL, M., MOURA, J. M. F., JOHNSON, J., PADUA, D.,935
VELOSO, M., SINGER, B., XIONG, J., FRANCHETTI, F.,936
GACIC, A., VORONENKO, Y., CHEN, K., JOHNSON, R. W.,937
AND RIZZOLO, N. 2005. SPIRAL: Code generation for DSP938
transforms. Proceedings of the IEEE, special issue on “Program939
Generation, Optimization, and Adaptation” 93, 2, 232– 275.940
SHANTZIS, M. A. 1994. A model for efficient and flexible image941
computing. In Proceedings of the 21st annual conference on942
Computer graphics and interactive techniques, ACM, New York,943
NY, USA, SIGGRAPH ’94, 147–154.944
11