project final report high performance pipeline compiler

Project Final Report

High Performance Pipeline Compiler

Yong He, Yan Gu

1 Introduction

Writing stream processing programs directly in low level languages, such as C++, is tedious

and bug prone. A lot of systems have been developed to simplify the programming of stream

processing application and to distribute computation to various devices (e.g. GPU or cluster),

such as GRAMPS[3] StreamIt[4], BSGP[1], Storm[2] and DryadLinq[5], etc. These systems

propose a set of general purpose programming constructs to hide the complexity of buffer

management and computation scheduling. However, all of them compromise performance in

certain ways to remain general purpose.

We recognized a special type of stream processing, which can be compiled into much

faster code : pipelines. Unlike general purpose stream processing which models computation

to be a direct acyclic graph of computation nodes (kernels), pipelines are formed by chaining

each kernel without introducing complex inter-kernel dataflow. As a result, scheduling and

buffer management can be done in a more efficient way, making it possible to generate high

performance code that runs as fast as hand tuned C++ code.

One example of pipeline application is the rendering pipeline. In a rendering pipeline, a

set of vertices are provided as input to the system. The pipeline computes internal triangle

representations from these vertices, clips them against screen, rasterizes them into pixels

and finally shades pixels to create an image. Traditionally, a rendering pipeline is always

written by hand and heavily tuned for highest possible performance. As a result, real-world

intricacies such as limited buffer size and low-level optimizations make it extremely hard to

reshape the pipeline or to make changes in schedules.

We designed a new program language for writing pipeline applications compilable to high-

ly parallelized code with rivaling performance to hand tuned code. Our language extends the

C syntax to express stages (computation nodes in our system), buffers used the synchronized

between stages and pipelines assembled by chaining stages and buffers. The compiler is able

to analyze the stage code and stage connections to figure out the most efficient scheduling

for the given pipeline, and generate a highly parallelized pipeline implementation with com-

parable performance to hand tuned code. The programs written in our pipeline specification

language is compact and easy to maintain (our sample rendering pipeline specification con-

tains only 143 lines of code and is compiled to 756 lines of C++ code). Therefore it is very easy

to change the pipeline and experiment with various different scheduling using our language.

1

2 Programming Model Design

We start from a simple version of C like language without pointers, and extend it with the

following constructs:

Stage. A stage is a kernel function that takes one or multiple items from input stream,

and emits zero or multiple items to output stream. The number of input items must be

statically defined, while the number of output items can be fully dynamic at runtime. A

stage should contain one or more emit statements that push an item to its output stream.

Pipe. A standard pipe is a place where stages take inputs from and write outputs to.

Pipes guarantees the correct ordering between items. For example, if a pipe containing items

[x1, x2, x3] is passed to a stage ({y1(xi), · · · , yn(xi)} = f(xi)) that outputs to another pipe, the

resulting pipe will contain items [y1(x1), · · · , yn(x1), y1(x2), · · · , yn(x2), y1(x3), · · · , yn(x3)].

There are two mainly kinds of variant of standard pipe: source and sink. A source is

the input to the whole pipeline, and a sink is the output. When linked to host application,

the user of a pipeline must provide data to source and may optionally register call backs to

handle data streamed into a sink.

Grid. A grid is treated as a special kind of pipe by the system, except that it represents

a fixed dimension, fixed size array of cells. A grid can have 1, 2 or 3 dimensions. Only items

marked as Element1D, Element2D or Element3D can be piped to grids with corresponding

dimensions. Stages that emit items to a grid must specify the index of the item, and the

item will be outputted to the specified location in the grid. If the specified location already

contains an item, a merge function is called to merge the incoming item and existed item

into a new item, and the new item is stored.

References. Items manipulated by stages can contain references to an item from previous

stage. The existence of references affects the scheduling of a pipeline because the referenced

item stored in the output buffer for previous stage must be kept alive when its user stage

is executed. Most stream processing systems do not allow such references because it makes

scheduling difficult, and make it very hard to generate distributed implementations. However,

references are crucial in generating high performance code that runs on a single machine, and

we choose to support it in order to reduce bandwidth and memory copying overhead.

Pipeline assembler. When all stages and internal pipes/grids have been defined, a pipe

assembler specifies how to assemble every piece together to form a pipeline.

2.1 Example Pipeline

The following example code written in the pipeline specification languages demonstrates

several important constructs. The pipeline takes a stream of integers, for each integer x, the

Square stage outputs x · x and 2x to the output stream, which is handled by external code

that prints the result.

2

parallel stage Square() : int>>int { emit in*in; emit in*in*2; } Source<int> input; Sink<int> output; pipe SimplePipe: input>>Square()>>output;

The pipeline compiler takes this code and produces a C++ class that represents the pipeline.

The interface of generated C++ code is shown below.

#ifndef OUT_H #define OUT_H #ifndef PIPE_SYSTEM_HEADER #define PIPE_SYSTEM_HEADER class SinkHandler { public: virtual void ProcessBuffer(void * data, int count) = 0; }; #endif class SimplePipe { public: virtual ~SimplePipe() {} virtual void Flush() = 0; virtual void SetInput_input(void * data, int count) = 0; virtual void SetSinkHandler_output(SinkHandler * handler) = 0; }; SimplePipe * CreateSimplePipe(); void DestroySimplePipe(SimplePipe * obj); #endif

To use the pipeline, the host application calls CreateSimplePipe() to create an instance

the pipeline. It then calls SimplePipe::SetInput input() to feed the source with an input

stream of integers. Optionally, it can call SimplePipe::SetSinkHandler output() to regis-

ter a call back for the sink. Finally, it calls SimplePipe::Flush() to initiate the pipeline.

3 Scheduling Pipelines

The compiler schedules pipelines onto current CPU architecture. The resulting schedule

is expected to be fully parallelized and as efficient as hand tuned code. We only exploit

intra-stage parallelism, i.e. we do not try to run two different stages at a time, since proper

handling of producer-consumer relationship requires fine grained synchronization between

threads, which is costly on current hardware.

3

3.1 Scheduling pipe connections

When two stages (S1 and S2) are connected by a pipe, a fixed size buffer (refer as B) is

allocated to store outputs from S1. Before S1 is able to be invoked, we run a pre-pass to

determine the maximum number of inputs S1 can consume, so that the outputs from S1 fit in

the size of B. This can be done by generating a pre-pass version of stage kernels that returns

only the number of output items given an input item X. At this time, the pre-pass stage

kernel of S1 is called in parallel for a segment of input stream at a time to collect output

rates for each element in the segment. If buffer B is large enough to hold the entire segment,

the next segment is fetched to pre-pass kernels until B is full. After the number of consumable

input items is determined, we pre-compute the output location for each input item by doing

a prefix sum over the output rates computed in pre-pass. We then run the stage kernel for

these items in parallel, generating actual outputs and store them in B, as shown in Figure 2.

X1 X2 X3

3 5 4

X4 X5

Pre-pass

(Collect output rates)

Output buffer size = 10

Prefix-sum 0 3 8

Run stage kernel S1 S1

ConsumePtr

…

Figure 1: Scheduling of a pipe connection.

Because the buffer size is limited, a stage may not be able to consume all items in input

stream at a time and the input stream must be split into many batches. In this case, the

stage fetches a first batch of input, does the computation and fills its output buffer, calls the

rest of the pipeline to drain the output buffer, and then fetches the next batch of input and

repeat the process. Since the same strategy applies for all the following stages, the generated

code will be in the shape of nested loops with each stage lies in a deeper level.

3.2 Scheduling grid connections

Unlike pipes, grids have different semantics such that it has the ability the merge two items

at a same location, and hence it is not constrained by the number of incoming items. It is

4

wise to accumulate as many items as possible in a grid before continuing to rest of pipeline

because by accumulating more items into the grid, we allowed more chance for items target

at same locations to get merged and save computation. In the extreme case, we can delay the

execution of the following stages until all initial inputs are drained. However, this is the case

only when the rest of the pipeline do not reference any items produced in previous stages. If

such reference exists, then the referenced buffer cannot be flushed to accommodate the next

batch of input, and we must schedule the rest of pipeline in order to drain the referenced

buffer. In short, we would like to delay the stages behind a grid as much as possible, but still

no later than the latest end of lifetime of referenced items. Figure 2 demonstrates an example

of grid scheduling. In this pipeline, stage 3 accesses items in pipe 1 through references, and

items in pipe 1 should not be flushed before stage 3 is executed and finished processing all

the references. The resulting schedule will be:

1. run stage 1 to process a batch of items from source and fill the buffer of pipe 1;

2. run stage 2 to process a batch of items in pipe 1;

3. repeat step 2 multiple times until everything in pipe 1 is processed;

4. run stage 3 to consume the grid, possibly accessing items in pipe 1;

5. goto step 1 and process next batch of items in source.

On the other hand, note that if stage 3 does not use any references into pipe 1, we can keep

running stage 1 and stage 2 until every item in source has been processed, and then run stage

3 only once to finish the pipeline.

Source Stage 1 Pipe 1 Stage 2 Grid Stage 3 Sink

Figure 2: An example pipeline containing grid and references.

The compiler relies on inter-stage dependency analysis generate such a scheduling. In

the first step, the compiler analyzes each stage to identify dependent fields in input items.

In the second step, inter-stage information containing the source of references is propagated

through the pipeline to figure out the actual dependence. We have carefully designed the

language semantics to make such analysis easy. First, references must be passed through the

pipeline. A stage can produce a structure containing reference fields. Reference fields must

be initialized when the structure is defined, and only input items to a stage can be assigned

to a reference field. For the example in Figure 2, the only way for stage 3 to get access to

Pipe 1 is to have stage 2 to pass a reference of its input to stage 3. To do so, stage 2 must

5

return a structure containing a reference field. Stage 2 will assign its input to this reference

field and emit the structure to the grid. When stage 3 has fetched the structure from the

grid, it can use this reference field to access the item in pipe 1. Since program can only be

written in this way, the compiler knows that when a structure containing reference field is

initialized and emitted, the stage creates a reference to its input stream. By propagating

this information through the pipeline, the stage that uses the reference can know where the

reference comes from.

4 Optimizations

When scheduling pipe connections, the pre-pass can be greatly simplified if we know the

stage’s output rate is invariant to its input, i.e. the stage produces same amount of output

items regardless of the actual input value. In such case, by running the stage’s pre-pass

code only once to get the exact output rate, we immediately know the maximum consumable

inputs by dividing the output buffer size by this output rate, avoiding executing the pre-pass

kernel once per input item, reducing the complexity of a pre-pass from O(n) to O(1).

To determine if a stage has invariant output rate, we run a dead code elimination style

optimization on the stage code. Initially, the emit instruction is marked as alive, but we do

not mark the emitted variable. The rest of the analysis is exactly the same as standard dead

code elimination, except that we do not actually remove the instructions. If any instruction

that reads the input is marked as alive, the analyzer returns ”output variant”, otherwise it

returns ”output invariant”.

Because the language has arrays and structures (which is compiled into pointer instruc-

tions), we run an alias analysis beforehand to get the correct result.

In our language, stages cannot interact with the external context other than emitting

items to output stream. Therefore, all functions are side-effect free and we can apply more

aggressive dead code elimination. We find C++ compiler fails to recognize this fact, so we make

our compiler perform dead code elimination before generating C++ code. This is particularly

useful in generating stage pre-pass code, because in many cases the return values of external

functions do not determine output rate.

5 Implementation

Our implementation of the pipeline compiler contains approximately 11,000 lines of C++ code.

The compiler does not rely on any third party projects or libraries. This implementation

includes comprehensive semantics checking, our self-designed IL (with C++ code emitter),

control flow graph utilities, pointer alias analysis and dead code elimination optimizer. We

have also developed a control flow graph visualizer for debug purposes, as shown in Figure 3.

6

Figure 3: Our control flow graph visualizer tool.

6 Evaluation

To evaluate the performance of compiled pipeline, we implemented a simple rendering pipeline

using our pipeline specification language. The code of the rendering pipeline in our pipeline

specification language is included as appendix. Figure 4 shows the renderer rendering the

”Sibenik Cathedral” scene containing 75,284 triangles.

Our compiler offers an option to generate performance instrumentation code for bench-

mark purposes. This data can be used to compute the scheduling overhead. We measured the

overheads of all the four stages in the rendering pipeline, and the performance and scheduling

overhead under different optimization settings is shown in Figure 5. We ran the experiments

on a PC with Intel Core i7-3820 3.0GHz quad core CPU and 16GB memory.

The renderer generated by our compiler achieves matching performance compared to a

hand tuned renderer. Our previous heavily optimized renderer finishes the frame shown in

Figure 4 at 35fps, while the generated renderer achieves 29fps. Note that our hand tuned ren-

derer exploits SIMD instructions for further parallelism, while the simple rendering pipeline

implemented in pipeline specification language does not leverage SIMD. We expect that a

proper SIMD version implementation in our language will achieve equal or better performance

7

than our previous renderer.

Figure 4: The compiled renderer running in action.

7 Surprises

We initially thought that optimizations such as dead-code elimination can be well handled

by the C++ compiler so we did not implement this optimization in our compiler. However

when we generated C++ code and checked the compiled assembly, we found that the C++

compiler failed to recognize the fact that the called external function is side effect free and

did not apply expected optimizations, and the resulting performance is unsatisfactory. We

then decided to implement dead code elimination ourselves, which adds another 2,000 lines

of code to our system since it is not depending on any existing compiler framework such as

LLVM. Fortunately we still managed to finish that in time.

We have not planned invariant output rate analysis and generate simplified scheduling

for this special type of stages, until we have compiled our first renderer and discovered the

fact that stages with invariant output rate are actually common cases in many graphics

applications.

8

20000

40000

60000

80000

Unoptimized Dead-code elimination Invariant output rate +dead-code elimination

stage 1 (shade vertex)stage 2 (assemble triangle)stage 3 (rasterize)stage 4 (shade fragment)

Scheduling Overhead

Time (cycles)

Figure 5: Runtime performance of compiled rendering pipeline at different optimization

settings. Left: unoptimized; middle: dead-code elimination on stage pre-pass code; right:

simplified pre-pass for stages with invariant output rate, and dead-code elimination on stage

pre-pass code. The dark colored region represents time used in core computation, and the

entire bar represents total stage.

8 Conclusion

In this project, we proposed a new programming language for pipeline applications. Our

compiler is able to analyze the pipeline and generate high quality implementation. With our

language, it is much easier to experiment with different variations of pipelines and study their

performance behavior. For example, the programmer can switch to another scheduling by

simply changing the type of connection between stages, or by changing the way to pass data

through the pipeline - by reference or by value, the compiler will do all the rest to generate

a best implementation.

For future work, we would like to extend the system to support more types of connec-

tions, and we also want to study how to schedule pipelines to next generation heterogeneous

architectures, where CPU and GPU are placed in the same chip and share caches. In this

case, being able to schedule different stages to different computation cores at the same time

becomes critical.

9 Credit distribution

We believe that we have done more than 125% of our initially expected work, and the total

credit should be distributed equally among the authors.

9

References

[1] Qiming Hou, Kun Zhou, and Baining Guo. Bsgp: bulksynchronous gpu programming. ACM Trans. Graph,page 12, 2008.

[2] https://github.com/nathanmarz/storm/wiki.

[3] Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. Gramps: Aprogramming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1–4:11, February 2009.

[4] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. Streamit: A language for streamingapplications. In Proceedings of the 11th International Conference on Compiler Construction, CC ’02,pages 179–196, London, UK, 2002. Springer-Verlag.

[5] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and JonCurrey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-levellanguage. In Proceedings of the 8th USENIX conference on Operating systems design and implementation,OSDI’08, pages 1–14, Berkeley, CA, USA, 2008. USENIX Association.

10

Appendix: source code of a simple rendering pipeline

#include "RendererUtilities.h" Source<int> Indices; in float[] Vertices; in int VertexAttributeSize; in int VertexShaderOutputSize; in int RenderState; // pointer to render state in int ScreenWidth; in int ScreenHeight; struct Triangle { float a0, b0, c0; float a1, b1, c1; float Z0, Z1, Z2; float divisor; int IsClipped; float[3] triCoord0,triCoord1,triCoord2; // clip vertex coordinates int MinX, MaxX; int MinY, MaxY; ref float[] vertex; // reference to vertices } struct Fragment : Element2D { float alpha, beta, gamma, z; ref Triangle triangle; } struct Color : Element2D { float R,G,B,A,Z; } extern void RunVertexShader(int renderState, float[] result, float[] vertex); extern int ClipTriangle(int renderState, float[] vertices, Triangle[] triangles); extern int ClipTriangle_Count(int renderState, float[] vertices); extern Color RunFragmentShader(int renderState, Fragment f); parallel stage ShadeVertex() : int >> float { float[256] buffer; RunVertexShader(RenderState, buffer, Vertices + in * VertexAttributeSize); for (int i = 0; i<VertexShaderOutputSize; i++) emit buffer[i]; } parallel stage AssembleTriangle() : float[VertexShaderOutputSize*3] >> Triangle { claim ClipTriangle_Count(RenderState, in); Triangle[7] triangles; int i; for (i = 0; i<7; i++) triangles[i] = Triangle{vertex:in}; int numTriangles = ClipTriangle(RenderState, in, triangles);

for (i = 0; i<numTriangles; i++) emit triangles[i];

} parallel stage Rasterize() : Triangle >> Fragment { float invW1 = 1.0f/in.vertex[3]; float invW2 = 1.0f/in.vertex[3+VertexShaderOutputSize]; float invW3 = 1.0f/in.vertex[3+VertexShaderOutputSize*2];

float divisor = 1.0f/in.divisor; for (int i = in.MinX; i<=in.MaxX; i++) for (int j = in.MinY; j <= in.MaxY; j++) { Fragment f = Fragment {triangle:in}; f.index0 = i; f.index1 = j; float x = i + 0.5f; float y = j + 0.5f; f.beta = in.a0*x + in.b0*y + in.c0; if (f.beta < 0.0f) continue; f.gamma = in.a1*x + in.b1*y + in.c1; if (f.gamma < 0.0f) continue; f.alpha = in.divisor - f.beta - f.gamma; if (f.alpha < 0.0f) continue; f.beta *= divisor; f.gamma *= divisor; f.alpha *= divisor; f.alpha = invW1*f.alpha; f.beta = invW2*f.beta; f.gamma = invW3*f.gamma; float interInvW = 1.0f/(f.alpha+f.beta+f.gamma); f.alpha *= interInvW; f.beta *= interInvW; f.gamma *= interInvW; f.z = in.Z0*f.alpha + in.Z1*f.beta + in.Z2*f.gamma; emit f; } } parallel stage ShadeFragment() : Fragment >> Color { Color rs = RunFragmentShader(RenderState, in); rs.index0 = in.index0; rs.index1 = in.index1; rs.Z = in.z; emit rs; } void zMerge(ref Fragment f, Fragment newFrag) { if (newFrag.z < f.z) f = newFrag; } Pipe<float> shadedVertices; Pipe<Triangle> clippedTriangles; Grid<Fragment> gbuffer(ScreenWidth, ScreenHeight, zMerge); Sink<Color> image; pipe RenderPipe: Indices>>ShadeVertex() >> shadedVertices >> AssembleTriangle() >> clippedTriangles >> Rasterize() >> gbuffer >> ShadeFragment() >> image;

project final report high performance pipeline compiler

Documents