parallel tessellation using compute shaders group 1: · pdf file ·...

Parallel Tessellation Using Compute Shaders

Group 1:

David Sierra

Matthew Faller

Erwin Holzhauser

April 27th, 2015

Sponsored by:

i

Table of Contents

Executive Summary .......................................................................................................................... 1

Project Motivation ........................................................................................................................... 3

Uses of Tessellation ..................................................................................................................... 3

About Tessellation ....................................................................................................................... 5

Tessellation Hardware ................................................................................................................. 7

Specifications and Requirements .................................................................................................... 9

Implementation Specification ...................................................................................................... 9

Metrics ......................................................................................................................................... 9

Research ......................................................................................................................................... 10

Integrated Development Environment Choice .......................................................................... 10

Code::Blocks (Codeblocks) ..................................................................................................... 10

Dev-C++ .................................................................................................................................. 11

Eclipse .................................................................................................................................... 11

Microsoft Visual Studio .......................................................................................................... 12

Various GPU Programming Languages ...................................................................................... 14

C++ AMP .................................................................. 14

OpenCL ........................................................................................................ 15

DirectCompute (DX Compute) ................................ 15

CUDA ........................................................................................... 15

ii

Microsoft Reference Rasterizer ................................................................................................. 16

OpenGL Specification ................................................................................................................. 17

Detailed Design .............................................................................................................................. 18

Tessellating Isolines ................................................................................................................... 18

General Overview .................................................................................................................. 18

Input Description ................................................................................................................... 18

Output Description ................................................................................................................ 19

Processing tessellation factors ............................................................................................... 19

Point Generation .................................................................................................................... 22

PlacePointIn1D ....................................................................................................................... 23

Point Connectivity .................................................................................................................. 25

Proposed Parallelism Technique Design .................................................................................... 26

Parallelizing Point Generation ............................................................................................... 26

Parallelizing Point Connectivity .............................................................................................. 30

Attempted Isoline Parallelization Techniques ........................................................................... 31

Run Times of the various implementations ........................................................................... 32

Tessellation of Triangles ............................................................................................................ 32

Processing Tessellation Factors ............................................................................................. 33



Parallel Triangle Tessellation Design .......................................................................................... 49

General Overview of Quads ....................................................................................................... 52

Input Description ................................................................................................................... 52

Output Description ................................................................................................................ 53

Process Tessellation Factors .................................................................................................. 53



Parallel Quad Tessellation Design .............................................................................................. 68

High Level Design ................................................................................................................... 68

Detailed Design ...................................................................................................................... 69

Processing Tessellation Factors ............................................................................................. 69



iii

Attempted Parallel Implementations .................................................................................... 74

Experimental Results: ............................................................................................................ 76

Design Summary ............................................................................................................................ 76

Isolines ....................................................................................................................................... 76

Triangles ..................................................................................................................................... 77

Quads ......................................................................................................................................... 77

Project Administration ................................................................................................................... 78

Facilities and Equipment ............................................................................................................ 78

Personal Work ............................................................................................................................ 78

Erwin Holzhauser ................................................................................................................... 79

Matthew Faller ....................................................................................................................... 79

David Sierra ............................................................................................................................ 79

Lessons Learned ......................................................................................................................... 80

Erwin Holzhauser ................................................................................................................... 80

Matthew Faller ....................................................................................................................... 80

David Sierra ............................................................................................................................ 81

Project Plan and Milestones ...................................................................................................... 81

Testing Methodology ................................................................................................................. 84

Testing Harness ...................................................................................................................... 85

Test Cases................................................................................................................................... 86

Error Reporting Conventions ..................................................................................................... 90

Project Summary and Conclusions ................................................................................................ 94

1 | P a g e

Executive Summary Tessellation is a process in which low detail surfaces are subdivided into higher

detail surfaces. This allows developers to save memory and bandwidth by only

having them supply low detail models over the memory bus. Then when they need

higher detail models, the graphics card can dynamically generate additional detail

in real time.

AMD and other hardware vendors usually implement this functionality in fixed

function hardware on the GPU die. This hardware is extremely fast and efficient

but it can only be used for tessellation. This means that when the hardware isn’t

tessellating, it is sitting idle and doing nothing. The scope of our project is to explore

a more general purpose software approach using the general purpose shading

units on the GPU.

The main goal of our project is implement the entirety of the tessellator’s logic as

a compute shader. Our project sponsor Advanced Micro Devices, Inc. (AMD)

imposed a language choice of Microsoft’s DirectCompute because they believed

overall it would be the easiest to use. DirectCompute is a high level language used

to program the graphics card’s general purpose shader units that was released as

alongside DirectX 11. Our main deliverable would then just be a collection of

DirectCompute source files (.hlsl) capable of handling all input values the

tessellator can possibly expect and generating the correct output for each case.

Our other expected deliverable is a detailed performance report comparing the

performance of our software implementation to the fixed function hardware’s

performance. The performance metric AMD is expecting us to measure is triangles

per second generated by the tessellator. All development will be done using

Microsoft Visual Studio because of its tight integration with DirectCompute making

it easier to develop and profile our code.

Figure 1: In this image we're tessellating a low resolution teapot. The face selected in red is a quad. Source: http://caig.cs.nctu.edu.tw/

2 | P a g e

Before the algorithm is described we will take a second to describe what a patch

is. A patch is a grid of points that map onto the face of a model and describe how

it will look in the 3d world. In order for 2 patches to connect they must have the

same amount of points and they must be spaced in the exact same way. In

essence this means that the outer edges of the patches must match up to adjacent

patches. This also means that the inside of the patch can look like anything it feels

like.

The tessellation algorithm takes in a standard set of inputs. The first is tessellation

shape. This can either be Isoline, Triangle, or Quad. Each value corresponds to a

target shape the algorithm is expected to generate. Isoline is a grid of lines,

Triangle is just a triangle, and Quad is a rectangle composed of triangles. The

second input is a grid of outer tessellation factors, one for each edge. So for

example, a quad would need 4 outer tessellation factors while a triangle would only

Figure 3: Fractional Even vs Fractional Odd mode on a quad

Figure 2: These 2 radically different patches can connect because their outer points align

3 | P a g e

need 3. The outer tessellation factors are different for each edge because the

output patch (grid of points) needs to be able to connect to other patches as

described above. The last input is a pair of inner tessellation factors. Even though

the user inputs 2 inner tessellation factors, Isolines use none of them and Triangles

only use 1. These inner tessellation factors describe how the inside of the shape

will be divided.

The last input is the tessellation partitioning mode. This can either be Integer,

Pow2, Fractional Odd, or Fractional Even. These values describe how the points

will be spaced out. Integer and Pow2 modes produce evenly spaced points while

fractional mode allow a much wider range of input factors.

The general purpose shaders that we will be programming for contain an extremely

numerous amount of ALUs. The card given to us by AMD to work with contains

over 2800 of them! Knowing this, we hope to outperform the specialized hardware

by utilizing thousands of ALUs to parallelize our calculations.

Project Motivation

Uses of Tessellation Tessellation is a powerful feature that can give 3d objects an incredible amount of

detail without loading a large mesh file onto the GPU. Instead a smaller 3d model

that takes up less space is moved to the GPU, freeing up space for other resources

such as textures. Whenever the model is drawn, detail is added via tessellation

before textures and lighting calculations have been applied, giving the same effect

as if the large model had been utilized.

Figure 4: Tessellated Toad. Credit: Crytek, cryengine3 tech demo.

4 | P a g e

Sometimes, when a model is viewed from a distance, a low level of detail is

acceptable or even preferred since it can be drawn faster. This is a common and

important optimization technique when rendering a complicated scene with

hundreds of meshes.

Commonly this is referred to as LOD (level of detail). Traditionally, all level of detail

needed to be handled by creating multiple instances of the same model – each

with lower amounts of polygons. Not only is this technique obnoxious for a 3d artist

to implement, but it also can be very costly since creating these additional models

is very time consuming. Implementing LOD using tessellation reduces a significant

amount of time that the artist needs to spend making the same model ad nauseum.

Dynamic Tessellation can be used to perform LOD on a per-triangle basis,

depending on a number of desired factors. Most often each triangle is tessellated

based on its distance from the camera, but can also be controlled based on the

angle between its face normal and the camera.

Figure 5: Face Normals pointing out away from the mesh into the environment. Credit: http://flylib.com/books/en/2.451.1.14/1/

5 | P a g e

About Tessellation Tessellation is a stage in the directX pipeline that allows a mesh object to become

more complex. In brief, tessellation can give a virtual environment an

unprecedented level of high quality visuals. The direct 11 pipeline is split into a

series of 8 stages, three of which pertain directly to tessellation: the Hull Shader,

Tessellator, and Domain Shader.

The hull shader calculates on a per-patch basis the level of detail needed for the

particular patch. The desired detail is controlled by the tessellation factors that the

hull shader determines. When the hull shader has finished calculating all of the

factors for a patch, the factors are passed to the tessellator.

The tessellator is responsible for generating primitives of three domain types:

Isolines

o A simple line

Triangles

o A simple triangle shape

Hull Shader Stage

Tessellator Stage

Domain Shader

Stage

6 | P a g e

Quads

o A quadrilateral composed of triangles

The tessellator subdivides one of these primitive geometry into one that has more

segments. In the case of isolines, it produces a line composed of additional points

and also outputs multiple displaced instances of the line.

When the tessellator has run to completion the next stage of the pipeline, the

domain shader, is called. The hull shader also has the option of passing the

tessellator factors that will cause the enter patch to be culled. In such a case, the

tessellator is skipped and the pipeline moves immediately to the domain shader

stage.

Figure 6: The original, undivided line is on the left with the new lines on the right hand side.

Figure 7: Output for quads (left) and Triangles (right)

7 | P a g e

The domain shader takes the barycentric UV coordinates output by the tessellator

and calculates the correct positioning for the new vertex in 3d space for each of

these coordinates. Typically the domain shader uses some sort of complicated

algorithm for the new position of the vertices, such as the Bezier, B-Spline or

NURBs algorithms.

Tessellation Hardware The primitive generation portion of tessellation is implemented on special fixed-

function hardware. This hardware is designed to calculate the primitive point

generation and primitive index connectivity quickly, and does an adequate job.

However using fixed hardware of this nature has several downsides:

1. Only one use

Tessellator

Domain Shader

Hull Shader

T. Factor 10(10 segments)

Figure 8: The flow of a single isoline patch through the tessellation pipeline. The Less detailed isoline is subdivided by the tessellator, then moved by the domain shader into a smooth arc.

8 | P a g e

Time, effort, and money must go into the design, integration, and

testing of complicated hardware that has zero reuse.

2. Limited Bandwidth

The hardware only has a limited amount of throughput and cannot

scale when the GPU demands more tessellation.

3. Takes up space on the GPU die

The hardware throughput could increase by taking up extra space on

the die with more powerful hardware. However, this would mean

more power consumption and less chip space for other more

important components. So there is a practical limit to the resources

that can be dedicated to this hardware.

Shader Performance Gains

The shaders take advantage of the general purpose computing power now

available on modern GPUs via the use compute languages. These compute

shaders run in a similar fashion to pixel and vertex shaders, applying a single

instruction concurrently across 1000s of pieces of data. Not only could an

intelligent implementation be fast, it has the potential to possibly outperform the

hardware. In addition, as the general purpose processing cores on the GPU

increase in performance, so too will a parallel shader implementation.

Figure 9: B-Spline algorithm with six control points interacting with an isoline. Credit: http://en.wikipedia.org/wiki/B-spline

9 | P a g e

Specifications and Requirements

Implementation Specification The implementation needs to process many threads in parallel

o Must use HLSL compute shader

o The exact structure will be discussed in detail at a later section but there are two ways we might split it up.

Give each patch its own thread to perform tessellation.

For a given triangle, split up the calculation into many smaller calculations, i.e. divide and conquer in parallel.

Operate on a per-patch level performing each point generation / index connection in parallel.

The system will be faster than the Microsoft Reference Rasterizer.

o This is a very naïve implementation, so hopefully gaining speed over the reference rasterizer will not prove difficult.

The system will be faster than the AMD tessellation hardware.

o It is important that we end up with much higher throughput (maybe an entire mesh can be tessellated faster with our implementation).

o It is worth noting that we also need to not tie up too many resources on the GPU. If our implementation is fast, but consumes the entire GPU, this is also no good.

The system will tessellate three domains: lines, triangles, and quads.

o Each of these has its own tessellation factors that affect how the geometry is tessellated.

o There are also 4 different ways to partition the geometry

Fractional odd

Fractional even

Integer

Power of 2

o 6 tessellation factors per patch

Metrics Our implementation needs to match the output of the Microsoft Reference

Rasterizer bit-for-bit.

o This metric will take our output, the reference output, and run a simple diff to see if there is a match.

1 0 | P a g e

Performance will be measured in triangles per GPU clock cycle

Performance will be measured using AMD proprietary diagnostics tools

Research

Integrated Development Environment Choice The reference rasterizer given to us was written in C++ so our range of IDEs to

choose from was actually quite large. The IDEs we tried were Code::Blocks, Dev-

C++, Eclipse, and Visual Studio.

Code::Blocks (Codeblocks) The first IDE we tried was Codeblocks. We came to it first because it is used very

frequently in the UCF undergrad course track. We also came in knowing that it was

a simple IDE for simple projects, but decided to give it an honest try anyways as it

would reduce potential time wasted learning the ins and outs of a new IDE.

Although it proved adequate for our simpler projects, once our project scope began

Figure 10: Screenshot of Code::Blocks

1 1 | P a g e

to expand and our lines of code started to balloon Codeblocks struggled to keep

up.

Dev-C++

The second IDE we tried was Dev-C++. We decided to try it because one of our

group members had used it before in a programming class. At first it seemed pretty

good but soon after we realized it suffered from the same problems as Codeblocks

(Not very scalable). Even worse, Dev-C++ is now sparsely updated. Not only is

this an undesirable trait in general, but GPU programming is relatively new and

growing field and we would like software that can keep up with it.

Eclipse Third, we tried Eclipse. Eclipse actually surprised us as a powerful C++ IDE. Our

entire group had only known of it as “the IDE from Java class”, so when we found

out that it supported C++ and actually had tons of features on top of that we were

stoked. After our first meeting AMD we were told to just have fun exploring OpenCL

and HLSL for a while. We initially chose Eclipse and OpenCL because they were

Figure 11: Screenshot of Dev-C++

1 2 | P a g e

open source and cross platform. Working with Eclipse and OpenCL was our

group’s first foray into GPU computing and we had very little complaints. Eclipse

and OpenCL both had plenty of tutorials and documentation online. The biggest

problem we had with eclipse was not even its fault. During our next meeting at

AMD we were told that we would be using Microsoft’s High Level Shader Language

(HLSL). This made it very obvious that we would have to learn to use Visual Studio

in order to get the most out of the language.

Microsoft Visual Studio Finally we ended up at Visual Studio 2013. The primary reason we ended up here

was because it was tightly integrated with the language that we had to end up

using (HLSL). The reason we stayed is because it ended up being everything

Eclipse was, but better. The UI was smoother, the auto complete functionality was

top notch, and debugging capabilities blew us away. The killer feature of the

debugger is its watch capability. With the watch feature you can assign any

variable while stepping through code to be watched. Any time after that, when the

variable’s value changes the watch window will automatically update it. You can

Figure 12: Screenshot of Eclipse

1 3 | P a g e

also modify the watch variable and have it be pre-processed by a function in your

code before it is displayed. For example, instead of watching variable x you can

watch Math.sqrt(x) and have that value displayed in real time. Visual Studio’s

was also one of the most customizable UI’s we had ever seen. You can partition

the window in as many ways as you would like. There is no denying how useful it

is to have 5 windows open all editing the same file when you have a 5000 line

source file that you are trying to dissect and understand.

Another great feature Visual Studio had was its peek definition function (Alt + F12).

This feature allows us to open a nested window in the code editor the peeks at

another functions definition. It is extremely useful when you have source files that

are almost 5000 lines long.

Looking into the future, Visual Studio 2015 is slated to have a new GPU

performance profiler allowing us to analyze frame rates, frame times, and GPU

utilization. It is near impossible to profile a graphics card with the current crop of

IDE’s unless you have proprietary software from the GPU vendors so it is very nice

that Visual Studio will have one included.

Figure 14: Visual Studio’s Peek Definition feature

Figure 13: Watching a fixed point number, but have its more readable floating point representation be shown

1 4 | P a g e

Various GPU Programming Languages When we were first tasked with toying around with GPU Programming we were

given a wide range of languages to choose from. We ended up choosing

Microsoft’s DX Compute, but we spent some time dabbling in: C++ AMP, OpenCL,

and even Nvidia’s CUDA.

C++ AMP AMP is a C++ library developed by Microsoft with the purpose of making it

extremely easy to run GPU code from within a C++ program. We would say that

they have succeeded with this. In order to run any code all you need to do is include

some headers and call a special function that execute a for loop on the GPU.

Really no more than 10 lines. The only problem is that you do not have much if

any control over the performance and it is difficult to get advanced functionality out

of the library.

Figure 15: Promotional screenshot of Visual Studio's new GPU profiling tool. Source: blogs.msdn.com

1 5 | P a g e

OpenCL OpenCL is a multi-device programming framework developed by Khronos, the

same group that develops OpenGL. As such, its open source and cross platform.

This is the real reason we tried it after we ruled out AMP. The best thing about it

was the online tutorials. OpenCL had a bunch of tutorials for programming a bunch

of stuff on GPUs. In addition to tutorials, most open source software designed to

run on the GPU was written in OpenCL. This was convenient because it gave a

glimpse into the high level design of GPU applications. Despite how awesome

OpenCL was, DX compute had a killer feature that we were not very willing to give

up.

DirectCompute (DX Compute) Our group did not even know DX Compute existed. And there was a good reason

for that, it is mainly a DirectX 11 feature, and DirectX 11 is not extremely popular

with developers nowadays. It also is not open source or free to use with enterprise

applications. The reason we chose it is because AMD strongly recommended it.

First of all there is an adaptive tessellation example written by Microsoft that we

can use as a reference. Also, the reference rasterizer given to us by AMD already

defines an interface for our project. This makes it easy for us to perform a variety

of tests as we develop our solution. The only thing we do not like about DX

Compute is the verbose syntax. It really is a handful for unexperienced

programmers.

CUDA CUDA is Nvidia’s proprietary shader language. Nvidia may be better known to you

as AMD’s direct competitor in graphics and that alone is reason enough for us not

using their language. Nonetheless we decided to try their language and it was

actually quite clever. To perform operation on the GPU you would use familiar c

functions prefixed with cuda. For example to allocate memory on the GPU you

would use cudaMalloc, to free memory on the GPU you’d use cudaFree, and

1 6 | P a g e

to copy memory to and from the GPU you can use cudaMemCpy. As cool as the

language is, we sadly were not even allowed to consider it.

Microsoft Reference Rasterizer

The Microsoft reference rasterizer (RefRast) is an app given to us by AMD to help

us visualize and test the tessellation algorithm.

The RefRast is split up into 2 pieces, the OpenGL renderer and the C

implementation of the Tessellator. The OpenGL renderer’s only real job is to take

the output from the tessellator and use the contents of the index and vertex buffers

Figure 16: Screenshot of AMD's Reference Rasterizer

1 7 | P a g e

to draw points and lines on the screen. In the background it takes input from the

user to control the tessellation factors and feeds them into the tessellator. The

RefRast is also the source of most of the figures in this document.

The second half to the RefRast is the C implementation of the tessellator. This

implementation provides us with perfectly accurate tessellator that follows the

Microsoft spec 100%. In fact, it is the Microsoft spec, a Microsoft employee wrote

the C tessellator and gave it to AMD so they could use it to develop hardware. And

it is that which explains the code’s layout. The code is not written to be efficient on

CPUs at all. In fact here is a direct quote from the comments:

The code is literally written in such a way that you can lay down circuits on a board

as you read the code. While this may be fantastic for AMD hardware engineers, it

is quite the nightmare for undergraduate computer science students with little to

no computer engineering experience.

Anyways, the RefRast contains something very useful for testing. It contains an

interface for an HLSL tessellator. This means our code can just implement the

interface and hook right in to the RefRast’s rendering capabilities. This makes it

easy to diff results of our tessellator with the results of the reference tessellator.

We can even render our data as an overlay on top of the reference data to gain

visual insight into bugs in our code.

Overall the RefRast is an invaluable tool both with its hard to read yet highly

detailed code, and its extensible rendering capabilities. It is a shame we only got

our hands on it about 2 months after we were assigned the project.

OpenGL Specification The “OpenGL Specification” is a document that describes the OpenGL graphics system. Version 4.5 of the document is freely available from the OpenGL webpage. The document intends to provide information about the nature and behavior of the OpenGL system, along with requirements for implementation. The specification covers tessellation control shaders, primitive generation, and evaluation shaders under the section for programmable vertex processing. The section on primitive generation is relevant to what we are trying to implement.

Along with an overview of primitive generation, the specification delves further into:

Subdivision

//There is lots of headroom to make this code run faster on CPUs. It was written merely as a reference for what results hardware should produce, with CPU performance not a consideration.

Figure 17: Quote from the tessellator source code

1 8 | P a g e

Tessellation Types: Triangles, Quads, and Isolines

Partitioning Modes: Equal Spacing, Fractional Even, Fractional Odd

For the tessellation types, the specification discusses which tessellation factors apply to which tessellation types. For the partitioning modes, the range of values to clamp, rounding of tessellation values, and division of segments along edges are provided.

Detailed Design

Tessellating Isolines

General Overview The isoline tessellator takes input values, processes them and creates a grid of

lines that have been subdivided based on the input values. Isolines are the

simplest form of tessellation as they require only 2 tessellation factors and a

tessellation mode. They are also extremely fast compared to triangles and quads.

Input Description Isoline tessellation takes only 3 inputs:

Tessellation Factor 1

o A floating point value describing how many segments the horizontal

lines will be made up of

Tessellation Factor 2

o A floating point value describing the number of horizontal lines

o This is always done in integer tessellation mode for isolines

Tessellation Mode

o Describes how the lines generated by the algorithm will be spaced

o Also used as a guideline for processing input values

Figure 18: Sample output grid

1 9 | P a g e

Output Description Isoline tessellation generates 2 output structures:

Index Buffer

o Contains a list of points generated in uv coordinates

Vertex Buffer

o Contains a list of ints

o These ints are stored 2 at a time and correspond to the endpoints of

each line segment

Processing tessellation factors If Tessellation factor 1 or 2 is less than or equal to 0, then the algorithm short

circuits and returns nothing. Otherwise we must process the tessellation factors

into more useful numbers.

The first step in processing the tessellation factors is to clamp them to their valid

ranges based on the tessellation mode. Below is a table specifying the valid values

the tessellations factors will be clamped to:

Table 1:Table showing valid tessellation factor ranges

Integer [1,64]

Pow2 [1, 64]

Fractional Odd [1, 63]

Fractional Even [2, 64]

Figure 19: An example of Fractional Odd and Integer partitioning given that both tessellation factors are set to 4.2

2 0 | P a g e

If the tessellation mode is set to one of the integer modes (Integer or Pow2) then

both the tessellation factors must be rounded up to the nearest whole number.

The tessellation parity is then stored. Tessellation parity can either be even or odd

and is based on whether the tessellation factor is even or odd.

Next, the tessellation factor context is computed for the first tessellation factor. The

tessellation factor context is a struct of numbers that are used repeatedly

throughout the tessellation process. Below is a list describing the values to be

stored in the tessellation factor context.

halfTessFactorFraction

o The fractional part of 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2

numHalfTessFactorPoints

o Half of the amount of points we expect to generate

o The ceiling of 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2

splitPointOnFloorHalfTessFactor

o This is an integer that tells the tessellator, when in fractional mode,

at what index to insert the small line segment

o Calculation of this number depends on the tessellation parity and is

only used in fractional tessellation modes

Even

𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵 (𝑓𝑙𝑜𝑜𝑟 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2) + 1

Odd

If the tessellation factor is less than 3, then this number

is simply 0

Otherwise it is equal to

𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵 ((𝑓𝑙𝑜𝑜𝑟 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) − 1) ∗ 2) + 1

invHalfTessFactorCeil

o The upper bound for the length of a segment

o Calculation of this number depends on the tessellation parity

Even

The inverse of 𝑐𝑒𝑖𝑙 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2

Odd

Figure 20: Image showing what splitPointOnFloorHalfTessFactor represents

2 1 | P a g e

The inverse of 𝑐𝑒𝑖𝑙 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2 − 1

invHalfTessFactorFloor

o The lower bound for the length of a segment

o Calculation of this number depends on the tessellation parity

Even

The inverse of 𝑓𝑙𝑜𝑜𝑟 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2

Odd

The inverse of 𝑓𝑙𝑜𝑜𝑟 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2 − 1

Next, the number of points per line is calculated. Once again this relies on the

tessellator parity and is described below in a table.

Odd Parity 𝟐 ∗ 𝒄𝒆𝒊𝒍(. 𝟓

+𝒕𝒆𝒔𝒔𝒆𝒍𝒍𝒂𝒕𝒊𝒐𝒏 𝒇𝒂𝒄𝒕𝒐𝒓

𝟐)

Even Parity 1 + 2 ∗ 𝑐𝑒𝑖𝑙(

𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2)

Figure 21: Table showing number of points calculations

Now we must compute the tessellation factor context for the second tessellation

factor. This is done the same way as the first one except for 3 key things:

We force the tessellation mode to Integer mode.

We must round up the second tessellation factor to the next whole number.

(because we are now in integer mode)

We must re-assign the tessellation parity based on whether our new second

tessellation factor is even of odd

2 2 | P a g e

We then calculate the number lines we will be producing. This is calculated in the

same way as the first tessellation factor but we subtract 1 from the final result. This

is because we do not want to draw the final line.

Next we calculate the number of points that will be drawn, which is just equal to

the number of points per line multiplied by the number of lines.

Point Generation Now that we have our processed tessellation factors we can now generate the

points. This is done with a nested for loop that loops through all the points in each

line for every line and runs the PlacePointIn1D function. The pseudo code for the

body of the nested loop is provided below.

Set tessellator parity to the parity of the first tessellation factor

U = PlacePointIn1D(tessellation factor context, current point)

Set tessellator parity to the parity of the second tessellation

factor

V = PlacePointIn1D(tessellation factor context, current line)

Add point (u, v) to list of points

Figure 22: Notice the missing line at the bottom.

2 3 | P a g e

PlacePointIn1D The first thing we do when generating points is make sure that the point we’re on

resides on the left side of the line. We do this because the points we generate are

symmetric about the center of the line. If the point is on the right side of the line,

we set 𝑝𝑜𝑖𝑛𝑡 = 𝑡𝑜𝑡𝑎𝑙 𝑝𝑜𝑖𝑛𝑡𝑠 𝑜𝑛 𝑙𝑖𝑛𝑒 − 𝑝𝑜𝑖𝑛𝑡. If the tessellation parity is set to odd

then we must subtract 1 from this value.

Now we make 2 values: indexOnCeilHalfTessFactor and

indexOnFloorHalfTessFactor. Initially these two numbers are set to point (the index

of the current point we are working on). If the point we are on is greater than the

splitPointOnFloorHalfTessFactor calculated in the tessellation factor context then

we reduce indexOnFloorHalfTessFactor by 1. The reason for this will become

apparent very shortly but remember that splitPointOnFloorHalfTessFactor is the

index at which we insert the small line segments in the fractions tessellation

modes.

We now make two new values which again reference our tessellation factor

contexts:

Figure 24: Image showing the need to subtract 1 from point when tessellation parity is odd

Figure 23: Image highlighting loop execution

2 4 | P a g e

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

= 𝑖𝑛𝑑𝑒𝑥𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 𝑖𝑛𝑣𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑙𝑜𝑜𝑟

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

= 𝑖𝑛𝑑𝑒𝑥𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 𝑖𝑛𝑣𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐶𝑒𝑖𝑙

Now we are ready to calculate the final location of the point we are placing.

𝑓𝑖𝑛𝑎𝑙𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛= 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛)

+ 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛)

The location is calculated by taking the minimum possible segment location and

maximum possible segment location and linearly interpolating them according to

how close the tessellation factor is to a whole number. Now obviously, this number

will never be bigger than locationOnCeilTessFactor (maximum segment position),

and it approaches that position as halfTessFactorFraction approaches .5. This is

the same as saying that as the tessellation factor approaches a whole odd (or even

depending on parity) number, the size of the small line segments approaches the

size of the other segments.

Figure 25: Geometric visualization of linear interpolation. Source: Wikipedia

2 5 | P a g e

Finally, if we flipped our point at the beginning we must set it to 1 – location since

we are mirroring it about the center of the line.

Now that we have a new generated point, we can store it in the next empty spot in

the vertex buffer.

i 0 1 2 3 4

Point.u 0 .25 .50 .75 1

Point.v 0 0 0 0 0

Table 2: Table showing a sample vertex buffer for the all N points in row 0

Point Connectivity Point connectivity is actually quite simple. We have a global array that is initialized

to have a size that is equal to the number of indices that we will end up with. This

is equal to the number of segments per line, multiplied by the number of lines. We

then multiply this by 2 since each segment is defined by 2 points. It is worth noting

that point connectivity only relies on the same tessellation factor context that point

generation relies on. This means that both can be done in parallel.

The actual process of connectivity generation is done in a nested for loop that goes

row by row and column by column and just inserts pairs of ints into the index buffer.

The integers stored in the index buffer correspond to vertexes in the vertex buffer.

Once again, the vertex is populated when the tessellator generates the points.

Figure 26: Image showing the process of tessellating from tessellation factor 3 to tessellation factor 5 in fractional odd mode

2 6 | P a g e

Proposed Parallelism Technique Design Parallelizing isolines is going to be the simplest of the trio of tessellation modes.

First we must obviously compute the tessellation factor context. This is a pretty

linear process and is needed by the point generator and connectivity generator.

The good thing is we need to do it twice, so we can do them at the same time to

save a little time.

After the tessellation factor context is computed, we are ready to generate the

points and connectivity. As previously stated these can be done independently of

each other.

Parallelizing Point Generation Before we go into how we intend to parallelize point generation I will outline some

basic facts about the AMD GCN architecture.

For this example I will be referring to the AMD Radeon R9 290x’s hardware.

Figure 27: Image showing how connectivity is stored in the index buffer

2 7 | P a g e

And AMD GPU Core consists of:

44 individual compute units.

Each compute unit consists of 4 SIMD vector processors.

Each SIMD vector processor consists of 16 ALUs.

Each SIMD vector processor executes the same instruction on all 16 of its

ALUs. Also, each ALU can operate on a different piece of data. This means

that each compute unit can have all of its 64 ALUs execute the same instruction

on 64 pieces of data. This leads to ridiculously parallelized code that far

surpasses what a normal CPU can do. Remember that each tessellation factor

for isolines maxes out at 64. This means that we can use one compute unit to

compute an entire row of points at the same time. This is the basis for our point

generation optimizations.

Figure 28: Image showing an AMD SIMD vector processor

Figure 29: Image showing a vector operation

2 8 | P a g e

Figure 30: High level overview of a compute unit Source: Anadtech.com

2 9 | P a g e

Thus, my proposed technique for point generation is to use one compute unit

to compute rows of values at a time. This would reduce the time complexity of

point generation from an O(n2) operation to an O(n) operation.

As a further optimization, imagine a tessellation factor 1 of 2, a tessellation

factor 2 of 16, and fractional even tessellation mode. The number of points we

would end up with would be 51. This means that we can calculate the

coordinates of all 51 points in O(1) time with just 1 compute unit!

In more general terms, the number of iterations required to compute all of our

points is simply:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠

64

Figure 31: All of these points can be generated in O(1) time

3 0 | P a g e

Parallelizing Point Connectivity We can accomplish the parallelization of point connectivity similarly to how we did

point generation. The computations inside the loop that generate connectivity do

not rely on the results of the previous loop. That means if we have 64 ALUs, we

can generate 64 line segments at a time.

One Caveat to this approach is best described with a question. How does each

thread know which segment it should be generating? Well, when a thread is

launched, it is assigned a thread id and group id among other things. If we launch

a number of threads equal to the number of line segments then each thread would

know which segment it is calculating. It would just look at its thread id. Then we

can do some simple calculations to arrive at which 2 points are the endpoints of

our segments. And remember, since this data will be stored in an index buffer, we

only have to store an int that corresponds to the right point in the vertex buffer.

Figure 32: 64 of these segments can be generated at a time

3 1 | P a g e

Attempted Isoline Parallelization Techniques One thread group per point

On the surface this idea seems terrible, and that is because it is. While coding a

“one thread per point” solution we thought we were assigning one point to one

ALU, but we ended up assigning one point to one compute unit.

One thread per grid of points

In DirectCompute, a dispatch group is limited to 1024 threads. Given this limitation

we tried having threads compute all data for grids of points in sizes of 8x8, 4x4,

and 2x2. Obviously 2x2 groups were the fastest, but they were not fast enough.

One thread per 4 collinear points

This implementation was similar to the 2x2 grids, but the points were on the same

line, saving some precious cycles only having to calculate the y-value once.

One group of threads (64) per 8x8 cube of points

In this implementation we only launched as many threads as we required. If the

output patch was less than 8x8 then it could have been handled entirely by a single

compute unit. This implementation proved to be the fastest thus far.

Figure 33: Simple flow chart diagramming the proposed parallelization technique

3 2 | P a g e

One group of threads per line of points

In this final implementation we launch one group of threads per line of points. This

proves to be the fastest by about half a millisecond.

Why not use both

As a last minute optimization the software actually analyzes the input and decides

whether it should use the cube method or line method to minimize resources used.

Run Times of the various implementations

What this test measures is the time the GPU spends in the compute shader stage

while going through every single possible isoline input in integer mode.

As can be seen, as the implementations as time went on were faster on more

classes of hardware.

Tessellation of Triangles The tessellation of triangles consists of the subdivision of triangles into smaller,

non-overlapping triangles that entirely cover the area of the original triangle.

Subdivision of triangles is dependent on:

Outer Tessellation Factors, t0 through t1

Inner Tessellation Factor, i0

0

50

100

150

200

250

300

350

400

1x1 8x8 4x4 2x2 4x1 8x8 bound 64x1 bound

Tim

e (m

s)

Grid Size

Isoline Test Results

HD 8490 R9 290X Intel Integrated

Figure 34: Graph showing run times of various isoline tessellation implementations

3 3 | P a g e

It ignores the Outer Tessellation Factors t2 and t3 and Inner Tessellation Factor

i1.

The high-level steps involved in the tessellation of triangles are:

1. Processing Tessellation Factors

2. Point Generation

3. Point Connectivity

Processing Tessellation Factors Processing of the tessellation factors takes as input:

Outer Tessellation Factors, t0 through t1

Inner Tessellation Factor, i0

It generates the following information utilized in point generation and connectivity:

Per tessellation factor Global Set Flags

Clamped value

Parity

Context, defined below

Number of points per edge

Total number of points

Base-case to do minimum tessellation work

Culled Patch

First, the tessellation factors must be checked for the base case where the patch

is culled—that is, not displayed. This is the case where all of the outer tessellation

factors are non-positive, in which case a culled flag is set to let the tessellator know, and further processing of the tessellation factors is aborted.

Next, the tessellation factors must be clamped—that is, bumped up or down to

ensure that they fall within a given range. Their appropriate ranges are based off

of the chosen partitioning mode. For integer and power of two partitioning, the outer tessellation factors are clamped to the range of values 1 through 64. Likewise, for fractional even and fractional odd partitioning, the outer tessellation values are clamped to the range of values 2 through 64 and 1 through 63, respectively. Largely, these same ranges apply to the clamping of the inner tessellation value, but the clamping for inner tessellation value for fractional odd partitioning is a special case; the lower bound for this range is incremented by 2-

16, the smallest value represented by the fixed point representation utilized in the tessellation hardware specification, so that the concentric inner triangle later generated does not overlap with the outermost triangle (See Figure 1). Because tessellation factors are read as floating point numbers, those factors with fractional parts must be rounded to the next nearest integer for the integer and power of two partitioning modes.

3 4 | P a g e

Figure 1 – Inner Triangle Does Not Overlap With Outermost Triangle

Next, the vertex and index buffers are cleansed before the bulk of the tessellation factor processing. In hardware, these buffers have enough memory to support the four tessellation factors set to the maximum values of 64, which would upwards of 3,000 vertices and 6,000 triangle subdivisions.

For integer and power of two partitioning, the parity of each tessellation factor is set to the parity of its clamped value. Otherwise, the parities for all tessellation values are set to the parity corresponding to the chosen partitioning; for fractional even and fractional odd partitioning, all tessellation factor parities are set to even and odd, respectively.

3 5 | P a g e

There is another base case for integer, power of two, and odd partitioning modes where all tessellation factors are equal to one; in this case, a single triangle is output (See Figure 2). Now that the tessellation factors are clamped to their appropriate ranges, this base case can be checked against. If it is this case, a flag is set to let the tessellator know that it will be doing the minimum amount of work, and further processing of the tessellation factors is aborted.

Figure 2 – Special Case; All Tessellation Factors Equal To One

Next, the context—a collection of values useful for point generation and connectivity—for each tessellation factor is computed as a function of itself, and its parity.

Tessellation Factor Context Variables:

invNumSegmentsOnFloorTessFactor := 1

𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

invNumSegmentsOnCeilTessFactor := 1

𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

3 6 | P a g e

halfTessFactorFraction := ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

numHalfTessFactorPoints := 𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

splitPointOnFloorHalfTessFactor :=

o If 𝑐𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 == 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, some value is picked for the tessellator to ignore; the hardware chooses 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 + 1.

o For odd tessellation factor parity,

If 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 == 1, 0

Otherwise, (𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵(𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 1) ≪ 1) +1

o Otherwise, (𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵(𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟) ≪ 1) + 1

Where,

halfTessFactor := 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2

o If the parity of the tessellation factor is odd, halfTessFactor is equal to 0.5, halfTessFactor is incremented by 0.5.

floorHalfTessFactor := 𝑓𝑙𝑜𝑜𝑟(ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟)

ceilHalfTessFactor := 𝑐𝑒𝑖𝑙(ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟)

numFloorSegments := 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 2

o For odd tessellation factor parity, the value is decremented by 1.

numCeilSegments := 𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 2

o For odd tessellation factor parity, the value is decremented by 1.

𝑹𝒆𝒎𝒐𝒗𝒆𝑴𝑺𝑩(𝒙: 𝒇𝒍𝒐𝒂𝒕): 𝒇𝒍𝒐𝒂𝒕, is a function that removes the most significant bit from a float.

Finally, the number of points corresponding to each tessellation factor is calculated. For the outer tessellation factors, this directly corresponds to the number of points for each respective edge. For the inner tessellation factor, the value corresponds to the number of points for the line that runs along edge of the inner concentric triangle adjacent to the outer triangle. Given the minimum bound on the tessellation factors for the different tessellation partitioning modes, the minimum point count for the inner tessellation factor is 4 for odd partitioning, and 3 for all others.

3 7 | P a g e

Figure 3 – Minimum Point Count for Inner Tessellation Factor

For odd parity tessellation factors, number of points are given by:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑜𝑑𝑑 = (𝑐𝑒𝑖𝑙(0.5 + 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟)

2) ∗ 2

Similarly, the number of points for other tessellation factor parities are given by:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑜𝑡ℎ𝑒𝑟 = (𝑐𝑒𝑖𝑙 (𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟

2) ∗ 2) + 1

Inner

Tessellation

Point Count

= 3

3 8 | P a g e

The inside edge point base offset is given by:

( ∑ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑒𝑑𝑔𝑒

3

𝑒𝑑𝑔𝑒=1

) − 3

Finally, the total number of points is given by:

( ∑ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑒𝑑𝑔𝑒

3

𝑒𝑑𝑔𝑒=1

) − 3 + 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡𝑠

Where,

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∶=(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑜𝑟 𝑖𝑛𝑠𝑖𝑑𝑒 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 ≫ 1) − 1

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∶=

{

3 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 + 1)

− 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠,𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑

3 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 + 1)) + 1,

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Point Generation The tessellator generates points along the rings—or concentric triangles—in a spiraling, clockwise (or counterclockwise) fashion, from the outermost ring towards the innermost rings. As the outermost and inner rings are generated as a function of the outer and inner tessellation factors, respectively, the outermost ring points can be computed separately from the points of the inner rings; this presents a clear opportunity for parallelization. Non-odd parity of the inner tessellation factor is a special case that implies the innermost ring be a single point, as opposed to a triangle (Figure 3); this case is handled separately.

3 9 | P a g e

Figure 4 – Spiraling Point Generation & Center Point Special Case

As point generation iteratively generates sequential points along sequential edges, in the chosen orientation, we need to keep indices for the current edge and point for each ring, as well as the point offset for purposes of storage in the vertex buffer.

Let us define the clockwise ordering of the edges for a ring as U, V, and W. Points generated along these edges are defined by a three-tuple of barycentric coordinates (u, v, w) with respect to U, V, W. Coordinate ‘w’ can be implicitly defined, however, as a function of ‘u’ and ‘v’.

4 0 | P a g e

Outermost Vertices with Barycentric Coordinate Labeling

For the point generation of each outer edges, we begin each edge by calculating the parity of the edge and the index of the edge’s end point, given by:

(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑜𝑟 𝑒𝑑𝑔𝑒) − 1. The number of points is decremented because we do not want to include the last point along the edge, as the next edge begins with it. We need the parity per edge because we need to reverse the orientation in which points are generated along some edges. This is because ‘u’ and ‘v’ alternate increasing and decreasing for coordinates along the axes of said edges. For points along the W (edge 2) and edge U (edge 0) axes, we have ‘v’ and ‘u’ coordinate values decreasing, respectively, so we have to reverse the orientation of the point along these edges—these correspond to even parity edges. Similarly, edge V

(0, 0, 1)

(0, 1, 0)

(1,0,0)

Edge 0

U

Edge 1

V

Edge 2

W

4 1 | P a g e

(edge 1), which has ‘u’ increasing and an odd edge parity, does not require a flip of the orientation along which points are generated.

Per edge, we start with the initial point and iterate through the end point, incrementing the point offset with every point for every edge. We calculate the index of the point’s positioning along the axis of the current edge using the edge parity, which is given by:

𝑞 ∶= {𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖, 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑

𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖, 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑒𝑣𝑒𝑛

Increasing and Decreasing of ‘u’ and ‘v’ Coordinates Along Axes

‘u’ d

ecre

asin

g

‘u’ increasing

edge parity := even

edge parity := odd

edge parity := even & 0x1

4 2 | P a g e

Now that the index for point placement is adjusted for the parity of the edge, we

have to define the point in barycentric space. For each point from 0 through the

end point corresponding to edges U, V, and W, the point is given by:

{

(0, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 𝑤), 𝑒𝑑𝑔𝑒 𝑈 (𝑒𝑑𝑔𝑒 0)(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 0, 𝑤), 𝑒𝑑𝑔𝑒 𝑉 (𝑒𝑑𝑔𝑒 1)

(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 1 − 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 𝑤), 𝑒𝑑𝑔𝑒 𝑊(𝑒𝑑𝑔𝑒 2)

Where 𝑤 and 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 are defined by:

𝑤 ∶= 1 − 𝑢 − 𝑣, and,

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ∶= {

0.5, 𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛) +

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔ 𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟,

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔

{(𝑝 − 1) ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,

and,

𝑝 ≔ {

(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞, 𝑞 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠

(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞 − 1,

(𝑞 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠) 𝑎𝑛𝑑 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑

.

It is important to note, for the formula for location, the complement—(1 −

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛)—is taken if 𝑞 is greater than numHalfTessFactorPoints.

For location: splitPointOnFloorHalfTessFactor, halfTessFactorFraction,

invNumSegmentsOnCeilTessFactor, invNumSegmentsOnFloorTessFactor, and

numHalfTessFactorPoints are given by the tessellation factor context for the

tessellation factor corresponding to the edge being worked on. Outer tessellation

factors 0, 1, and 2, correspond to edges U, V, and W, respectively. The points are

stored in the vertex buffer at the index of the point offset.

Similarly to the outermost ring, points for the inner rings are calculated iteratively

from the outermost inner rings towards the center ring, along the edges in a

clockwise fashion. The number of inner rings is given by:

𝑛𝑢𝑚𝑃𝑜𝑖𝑛𝑡𝑠𝐹𝑜𝑟𝐼𝑛𝑠𝑖𝑑𝑒𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≫ 1,

Where numPointsForInsideTessFactor comes from the processed tessellation

factors.

Because the points for all edges of the inner rings are generated from a single

tessellation factor, each edge had the same number of segments (and points) per

ring. Per ring, the start and end points are given by follows: 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡 ∶= 𝑟𝑖𝑛𝑔

and 𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡 ∶= (𝑛𝑢𝑚𝑃𝑜𝑖𝑛𝑡𝑠𝐹𝑜𝑟𝐼𝑛𝑠𝑖𝑑𝑒𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 1 − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡). This is

4 3 | P a g e

because each inner ring has two less points per edge than the corresponding edge

of the ring surrounding it, and the property still holds that we the last point along

the edge that the following edge begins with.

All Edges of Innermost Triangles, Per Triangle, Have Equal Points and Segments.

For each edge of the inner rings, the parity is still relevant, because the property

still holds that we must switch the orientation of points generated along the axes

of the edges to generate points sequentially.

Inner Tessellation Factor: 7 (7 Segments for Ring 1)

For Ring 1, each edge has 6 points.

For Ring 2, each edge has 4 points.

Generally, Ring i Points Per Edge := (Ring 1 Points) – 2*(i – 1)

Ring 0

Ring 1

Ring 2 Start Point

For Edge 1

Ring 2 End Point

For Edge 1

4 4 | P a g e

We calculate the placement of the point along the axis of the current edge using the edge parity, which is given by:

𝑞𝑖𝑛𝑛𝑒𝑟 ∶

= {𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖, 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑

𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − (𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡), 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑒𝑣𝑒𝑛

Now that the index for point placement is adjusted for the parity of the edge, we

have to define the point in barycentric space. For each point from 0 through the

end point corresponding to edges U, V, and W, the point is given by:

{

(𝑝𝑒𝑟𝑝, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (

𝑝𝑒𝑟𝑝

2),𝑤) , 𝑒𝑑𝑔𝑒 𝑈 (𝑒𝑑𝑔𝑒 0)

(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (𝑝𝑒𝑟𝑝

2), 𝑝𝑒𝑟𝑝, 𝑤) , 𝑒𝑑𝑔𝑒 𝑉 (𝑒𝑑𝑔𝑒 1)

(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (𝑝𝑒𝑟𝑝

2), 1 − (𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (

𝑝𝑒𝑟𝑝

2) − 𝑝𝑒𝑟𝑝,𝑤) , 𝑒𝑑𝑔𝑒 𝑊(𝑒𝑑𝑔𝑒 2)

,

Where 𝑝𝑒𝑟𝑝 is given similarly to 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 by:

𝑝𝑒𝑟𝑝 ∶= {

0.5, 𝑝𝑝𝑒𝑟𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛) +

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

,

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔ 𝑝𝑝𝑒𝑟𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟,

𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔

{(𝑝𝑝𝑒𝑟𝑝 − 1) ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑝𝑝𝑒𝑟𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟

𝑝𝑝𝑒𝑟𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

, and,

𝑝𝑝𝑒𝑟𝑝 ≔ {

(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡, 𝑞𝑖𝑛𝑛𝑒𝑟 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞𝑖𝑛𝑛𝑒𝑟 − 1,

(𝑞𝑖𝑛𝑛𝑒𝑟 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠) 𝑎𝑛𝑑 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑

.

In the formula for 𝑝𝑒𝑟𝑝, similarly to that for location, the complement—(1 − 𝑝𝑒𝑟𝑝)—

is taken if 𝑞𝑖𝑛𝑛𝑒𝑟 is greater than numHalfTessFactorPoints. After the above

calculations for 𝑝𝑒𝑟𝑝, the value is multiplied by two-thirds. 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 is defined as

for the outermost ring, except that for 𝑝𝑒𝑟𝑝 and 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛:

splitPointOnFloorHalfTessFactor, halfTessFactorFraction,

invNumSegmentsOnCeilTessFactor, invNumSegmentsOnFloorTessFactor, and

numHalfTessFactorPoints are given by the tessellation factor context for innermost

tessellation factor.

Lastly, the special case is handled where non-odd parity of the inner tessellation

factor produces a single point at the center. It is simple hardcoded as (1

3,1

3,1

3). The

points are stored in the vertex buffer at the index of the point offset.

4 5 | P a g e

Special Case; Inner Tess Factor = 6, Pictured Above. Non-Odd Inner Tess Factor Parity Produces a Degenerate Triangle—Single Point—at the Center.

Point Connectivity Like point generation, point connectivity occurs in a clockwise (or counterclockwise), spiraling fashion from the outermost ring towards the innermost ring. Triangles are formed between all rings, one triangle side at a time per ring.

4 6 | P a g e

Point Connectivity occurs in a spiraling, clockwise fashion from the outermost ring towards the center. Triangles are generated in the order indexed at the center of the triangle.

Several variables are necessary for the stitching (connecting) of points between two given edges—i.e. per ring. First, we need to know which ring we are working on because this is an iterative process—an index is maintained for this. Second, we need the tessellation factor context for the outer and inner edge of the rings we are working through. Finally, you need the offset of the point for the beginning of the inner and outer edges for the side of the ring you are working on.

4 7 | P a g e

Point connectivity of the outermost ring is a separate case from the point connectivity of the inner rings. Per ring, the connectivity of the U (first) and V (second) edges are a separate case from the connectivity of the W (third) edge.

For the outermost ring, stitching of the edge points is a function of the tessellation factor of the outer edge in question and the inner tessellation factor. This is because the point generation of the outer edges are a function of the outer tessellation factors, and similarly, the inner edges of the inner tessellation factor. The stitching per ring edge is divided into a first and second half, with certain special cases that can result in a “quad”, or an outward pointing triangle—inner to outer edge—or similarly defined inward facing triangle at the center. Triangles are stitched by connection of points from the outer and inner edge, with an edge normally beginning at the point where the last left off. Points on either edge are advanced forward in an alternating fashion as the point connections are defined, until the second-to-last point for each edge is reached, marking the beginning of the next side of the ring.

Example of initial Outer-Outer-Inner stitch, and Inner-Outer-Inner stitch, from below.

Outer-Outer-Inner Stitch

Inner-Outer-Inner Stitch

4 8 | P a g e

For the first half of the ring stitching, the first triangle generated is defined by the first outer point, the second outer point, and the first inner point; the outer point offset has been incremented, as no further triangles can be defined with the first outer point without overlapping the first triangle. For convention, such a stitch will be referred to as outer-outer-inner—i.e. the triangle is defined by connecting the currently indexed outer edge point to the following outer edge point, thereby incrementing it, and then finishing the triangle by a connection to the currently indexed inner edge point. Through the first half, the ring is advanced by alternating between inner-outer-inner and outer-outer-inner stitches.

The special middle cases occur when either (or both):

1. The tessellation factor parity for the outer and inner edge differs.

2. The inner tessellation factor is of odd parity.

When the second condition is true and the parities of the inner and outer tessellation factor are equal, a “quad” is formed in the middle by two triangles of inner-outer-inner and inner-outer-outer stitches.

Special case Inner Tessellation Factor is odd (3), and Inner and Outer Tessellation Factors are equal, forming a center “quad” for Edge U of the outermost ring. The quad consists of triangles indexed 2 and 3.

4 9 | P a g e

When the first condition holds and the inner tessellation is of even parity, an inward-pointing triangle is formed by an inner-outer-outer stich.

In all other cases where both conditions occur, an outward-pointing triangle is formed by an inner-outer-inner stitching.

For the second half of the ring stitching, the ring is advanced by alternating between outer-outer-inner and inner-outer-inner stitches as in the first half. The sequential combination of alternating stitch patterns the ring advances with is determined with a look-up table.

Stitching of the inner tessellation rings is simply a function of the inner tessellation factor. This is because all points generated along the edges of all of the concentric inner triangles is determined by the processed inner tessellation factor. Stitching of the edge of an inner tessellation ring begins with an outer-outer-inner stitch, and then divides the stitching of the length of the triangle into two mirrored halves. For the first half, connectivity of the ring edge consists of strictly alternating outer-inner-inner and outer-outer-inner stitches; each pair of these forms a “diagonal” that points from the outer edge towards the inner edge. For the second, mirrored half, connectivity consists of strictly alternating inner-outer-outer and inner-outer-inner edges. The end of the ring edge is a base case outer-outer-inner stitch.

If the inner tessellation factor parity is odd, the inner edges of the centermost ring constitutes a single triangle; otherwise, the innermost ring constitutes a single point.

Parallel Triangle Tessellation Design

The work for triangle tessellation is split into two separate shaders: point generation and point connectivity, which run in parallel. This was the simplest approach to implement, made possible by the inward-spiraling pattern in which points are generated and connected. Because we are guaranteed the order and quantity of points generated along edges, their two-dimensional coordinates are not needed for point connectivity.

All that is needed for the above approach is a set of preprocessed data derived from the tessellation factors; this includes such information as edge parities, number of points per edge, and additional information used to determine the spacing between points. This information is stored in a structure, initialized with the tessellation factors for the primitive patch being tessellated. The CPU passes this structure to the two shaders through a read-write structured buffer, the contents of which get written to group shared memory. While a separate shader could be dedicated to this preprocessing alone, its’ percentage toward the whole of the tessellation workload was small enough that the point generation and connectivity shaders were implemented to compute it for themselves.

5 0 | P a g e

The point generation shader was parallelized to allow a single point calculation per thread, based on its’ thread ID. For triangle tessellation where all of the corresponding tessellation factors (three outer and one inner) are set to the maximum of sixty four, 4225 points are computed. Still, even in these ideally large cases for parallelization, an insignificant speedup was achieved from a point-per-thread approach, presumably due to the overhead of each thread calculating its’ own contextual information about where its’ point is located (e.g the corresponding edge and ring).

For point generation, 67 thread groups of 64 threads are launched. This approach

has the benefit of aligning groups of work along wavefronts, which consist of 64

compute units each. While this approach allows for most every thread of each

thread group to do work at the higher tessellation factors, many threads are wasted

at the lower factors. An approach to mitigate this would be separate the

Dispatch

Threads

Dispatch

Threads

Point Generation

Shader

Process

Tess Factors

Generate

Points

Tess Fact

Context

Point Connectivity

Shader

Process

Tess Factors

Generate

Connections

Tess Fact

Context

CPU

Vertex Buffer Index Buffer

5 1 | P a g e

preprocessing of tessellation factors from point generation, so that the quantity of

points to be generated can be used to dynamically allocate threads for the point

generation shader. However, a downside to this approach is the inability for

compute shaders to dispatch work for themselves; the CPU would have to dispatch

the preprocessing work, read back the relevant information, and feed it back to the

respective shaders. This is not preferable as CPU-GPU intercommunication is very

expensive.

Potential improvements include reducing the overhead of threads computing their own contextual information and applying significant parallelization to point connectivity. Like quadrilaterals, the complex connectivity pattern (including the patching of incorrectly produced intermediate values) proves hard to parallelize while maintaining correctness.

5 2 | P a g e

General Overview of Quads When the type of geometric primitive is set to quad, our Tessellator subdivides a single quadrilateral into more detailed geometry. There are separate levels of detail for the inside and outside of the new quad. Each outer edge can be split into a different number of segments. Likewise, the number of inside horizontal and vertical segments can be set to different levels, as shown below.

In figure 1, the outer-left side of the quad in is set to 3 segments and the inside horizontal and vertical portions of the quad are set to 3 and 4 segments, respectively.

Input Description The tessellation factors that control the level of detail are as follows:

Inner1 – (horizontal or U-Axis)

Figure 35: A subdivided quad with different outer and inner detail levels. Connected Triangles are numbered in blue.

5 3 | P a g e

Inner2 –(vertical or V-Axis)

Outer1 –(left)

Outer2 –(top)

Outer3 –(right)

Outer4 –(bottom)

These will be each be a floating point number that represents how many segments to create.

Output Description The quad Tessellator outputs both a vertex and index buffer.

Vertex Buffer o The vertex buffer stores a (U, V) coordinate for each generated point.

Index Buffer o The index buffer stores the order that vertices are connected to each

other.

Process Tessellation Factors For each of the tessellation factors, a series of “magic numbers” is calculated that contains useful information. First, if any of the tessellation factors are equal to or below zero, a flag is set and the patch is culled later on. Otherwise the factors are clamped based on what partitioning mode is used. The two fractional modes both have different ranges from the integer modes, as shown in the following table:

Mode Factor Range

Integer [1, 64]

Pow2 [1, 64]

Fractional Even

[2, 64]

Fractional Odd

[1, 63]

If the partitioning mode is set to one of the integer modes (integer, or Pow2), then the ceiling of the tessellation factor is stored in the magic numbers structure. The tessellation factor’s even or odd parity is stored as well, unless the Tessellator is set to a fractional mode, in which case the mode’s parity is stored. At this point, after clamping and rounding have been completed, if all factors are set to 1, then we set a special flag to do the minimum amount of work. In this case, since subdividing a line into 1 segment is rather trivial, no additional information is required.

If some of the tessellation factors are greater than 1, additional information is needed for each factor. This is called the tessellation factor context, or TessFactorCtx which stores the following:

Variable Description

halfTessFactor tessFactor/2

5 4 | P a g e

Note: 0.5 is added to this number on the following conditions: Mode is fractional odd, or halfTessFactor is itself 0.5.

halfTessFactorFloor Floor(halfTessFactor)

halfTessFactorCeil Ceil(halfTessFactor)

halfTessFactorFraction halfTessFactor – halfTessFactorFloor

invHalfTessFactorFloor 1 / ( 2 * halfTessFactorFloor - 1) if mode is ODD otherwise 1 / (2 * halfTessFactorFloor) Used in placing points into the correct position using linear interpolation later on.

invHalfTessFactorCeil 1 / ( 2 * halfTessFactorCeil - 1) if mode is ODD otherwise 1 / (2 * halfTessFactorCeil) Used in placing points into the correct position using linear interpolation later on.

splitPoint A. Remove Most Significant bit of halfTessFactorFloor

B. If fractional_odd subtract 1 C. Multiply by 2 and add 1.

A very important value in determining which point is the “split point” for fractional even and odd partitioning modes. Unused for other modes.

numFloorSegments halfTessFactorFloor * 2 Note: If in fractional Odd Mode subtract 1 from this.

numCeilSegments halfTessFactorCeil * 2 Note: If in fractional Odd Mode subtract 1 from this.

numPointsForOutsideEdge[4] Stores the number of points that the Tessellation factor will generate per edge.

numPointsForInside[2] Stores the number of points that the inner tessellation factors will generate for both the inside axis.

tessellationParityInner[2] Stores if the rounded tessellation factor is even or odd.

tessellationParityOuter[4] Stores if the rounded outer factors are even or odd.

For each outside edge and inner axis of the quad we store the number of points that are going to be generated based on the proper tessellation factor. The total number of outside points are stored in the magic numbers as the base offset for the inside points. The total number of both outside and inside points is stored here as well.

5 5 | P a g e

Point Generation At the start of point generation, the tessellator sets an integer named pointOffset to zero. This variable is used as an index for accessing the vertex buffer. Quad point generation is split up into two processes. The outside point generation, which consists of generating the points for each of the four edges of the quad, and the inner point generation, which consists of all the points interior to the aforementioned edges. When both are put together, the process can be thought of as traversing the quad in a spiral pattern. Such a spiral pattern takes advantage of the fact that a certain set of points always lies along a line perpendicular to one of the two U-V Axis. Therefore, a portion of the line remains unchanging. This is not unlike the equation of a horizontal or vertical line. The difference being that here, each point is a UV coordinate as described earlier.

Outside Points

As the edges are traversed, odd edges have a constant V location. Conversely, even edges have a constant U location. Edge 0 has a U coordinate of 0, while edge 2 has a U coordinate of 1. Edge 1 has a V coordinate of 0 while edge 3 has a V coordinate of 1. The other portions of the UV coordinate that are not constant are calculated via the placePointIn1D function. This is done per edge by a for-loop that loops from 0 to an endpoint calculated ahead of time in the TessFactorContext. The end point is subtracted by 1 since the next edge’s loop

will already calculate that point.

Beginning with edge 0, points are placed bottom to top along the V axis. On edge 1, points are placed from left to right along the U axis. On edge 2, points are placed top to bottom along the V axis, and finally on edge 3, points are placed right to left along the U axis. This allows for simple code reuse since both even edges and both odd edges use the same point calculations, except in opposite order. On edges 1 and 2, the order is flipped by subtracting the current point from the end

Figure 2: The edges of the quad are labeled 0 - 3, each of the four points shown could be any arbitrary point along an edge.

0

1

2

3 (0, V)

(1, V) Point

(U, 0)

(U, 1)

5 6 | P a g e

point. Each time a point is placed into the vertex buffer, the pointOffset variable is incremented. As an example:

DefinePoint(1, param, pointOffset++)

The first argument is the U-coordinate, the second argument is the V-Coordinate, and the third is the pointOffset. This example is placing points along edge 2. After the last point on the last edge is placed, the outer point generation is completed.

Inside Points

The inside point generation is a bit more complex, and involves 3 nested for loops.

o for each ring for each edge (in a ring)

for each point (on an edge)

Figure 3: The spiral pattern of the outside points. The numbers boxed outside depict the values passed to placePointIn1D

0

1

2

3 2 1

3

2

1

0 1 2

Inside Spiral Portion

5 7 | P a g e

The number rings are calculated using some useful data from the tessellation factor context. Each tessellation factor has an associated number of points that it

would generate if it were tessellating one line. For example, a tessellation factor of 3 would generate 4 points, since a line would be divided into 3 segments. This can be seen in figure 3.

Variable Calculation

(int) startPoint This is just the current ring, renamed for clarity based on how it will be used later.

(int) numRings Min(numPointsInner1, numPointsInner2) / 2

(int) endpoint[0] numPointsInner1 – 1 – startPoint

(int) endpoint[1] numPointsInner2 – 1 – startPoint

To determine the number of inside rings take the smallest of the point counts associated with the two inner factors. Halving this smallest number will yield the number of rings. At the beginning, the ring loop initializes the startPoint to 1, and

Figure 36: Each ring places points along 4 edges, and each edge contains some number of points based on the inner factors.

startPoint

for ring #1 endPoint 1

Start/end Ring 2

Start/end Ring 3

Start/end Ring

4

5 8 | P a g e

the two endpoints are calculated based on the point counts associated with the two inner factors. Each iteration of the ring loop increments the startPoint and recalculates both end points: endpoint[0], endpoint[1].

The number of edges to use for the edge loop is 4, since each ring has 4 for edges. The edge logic is similar to the previous outer edge calculations but with some additional challenges. The primary difficulty lies in proper placement of the perpendicular portion of the UV coordinate. When calculating the outer edges, this value was trivially either a zero or one. Now, the value can range anywhere between [0, 1] and changes depending the level of ring being calculated. Each iteration of the edge loop calculates several important values including the perpendicular portion of the UV coordinate.

Variable Calculation

(int) parity[0] oddOrEvenParity(edge) note: the current edge This governs whether an edge is moving along the U or V axis.

(int) parity[1] oddOrEvenParity (edge + 1) note: the next edge This governs whether an edge is moving along the U or V axis.

(int) perpendicularAxisPoint For edges 0 and 1: = startPoint For edges 2 and 3: = endpoint[ parity[0] ] The axis point is passed to the placePointIn1D function and changes depending on edge and the parity of the current edge.

(float) perpParam The perpendicular portion (either U or V) returned by placePointIn1D

After the perpendicularAxisPoint is calculated it is passed to the placePointIn1D function which then returns the perpendicular U or V coordinate to use in the next loop.

The inner most loop is responsible for calculating the second portion of the UV coordinate. It loops from p = startPoint to p < endpoint[parity[1]]. In other words, its terminating condition depends on the parity of the next edge (really the inverse parity of the current edge). When the edge is edge 0 or 3, the order of the points is reversed. For example, if looping from p = 1 to p < 4, instead of placing points in the order of {1, 2, 3} they are placed as : { 4 , 3, 2 }. This reversed point, q, is calculated as:

5 9 | P a g e

q = endpoint[parity[1]] – ( p – startPoint)

Once this reverse point has been determined, the placePointIn1D function is called to calculate the second portion of the UV coordinate. A point will be defined based on the parity of the edge. If the edge is odd, then the U coordinate the perpendicular coordinate that will be reused. Otherwise, the V coordinate will see reuse.

Odd: o DefinePoint(perpParam, param, pointOffset++)

Even: o DefinePoint(param, perpParam, pointOffset++)

Figure 37: The inner for loop for the first edge calculates the values points to pass to placePointIn1D: {7, 6, 5, 4, 3, 2}

The V axis tess factor here has been rounded to 8. And numPointsInner2 is

Calculated as 9. The startPoint is 1, and the endpoint = 9 – 1 – startPoint = 7

7

6

5

4

3

2

Point order is reversed since the

edge is even (0)

6 0 | P a g e

After the last ring has been completed, inner point generation is finished, except for two exceptional cases that occur only when an inner tessellation factor is rounded to an even number. When this occurs, the middle portion of the ring becomes degenerate, i.e. it degenerates into a single row or column of points, and the logic that normally calculates a ring fails. This means that two additional loops need to be created to handle these two special cases.

The easiest way to handle this behavior is to run the regular ring loop just as before, but afterward use the following pseudo-code:

If (tessParityInner[0] == EVEN OR tessParityInner[1] == EVEN)

o If numPointsInner1 (U-Axis) > numPointsInner2 (V-Axis) for each point

DefinePoint( p, 0.5, pointOffset++)

o If numPointsInner1 (U-Axis) <= numPointsInner2 (V-Axis) for each point

DefinePoint( 0.5, p, pointOffset++)

Figure 38: The degenerate rings of the two edge cases are shown in red, while the regular rings are shown in blue.

6 1 | P a g e

This ensures that the middle line for both of these cases will be filled in properly along the center 0.5 for either UV coordinates. And, this technique also works regardless of the number of rings or total number of points.

With both the inner and outer point generation completed, the Vertex Buffer now contains all of the correct UV coordinates.

Point Connectivity The purpose of point connectivity is too create a series of indices into the vertex buffer that together define a primitive geometric shape that the graphics card can draw and rasterize. In the case of a quad, that primitive is the triangle, even for the simplest non-tessellated quad, which is made up of two triangles and four points. Each of the two triangles would normally require three vertices, for a total of six. The index buffer allows two of the vertices to be reused, since at least two vertices

Figure 39: No matter how many points totally points or width, there will always only be one degenerate ring, and it will always lie along the line U = 0.5 or V = 0.5

V = 0.5

6 2 | P a g e

must be shared by these triangles. In longer triangle strips, this savings can be quite significant.

When a quad is made of two triangles, the vertex buffer contains the following:

Vertex Number U V

0 0.0f 0.0f

1 0.0f 1.0f

2 1.0f 1.0f

3 1.0f 0.0f

The direction that the triangles are wound depends on the vertex order that is specified in the Index Buffer. There are two possible winding directions:

o Counter clockwise (CCW) Example: {0, 2, 1}

o Clockwise (CW) Example: {0, 1, 2}

Index Corresponding Vertex

0 0

1 2

2 1

3 0

4 3

5 1

Figure 8: The triangles are numbered in blue. The four vertices are numbered 0 - 3. The winding direction for the triangles is counter clockwise (ccw).

Vertex #

0

1 2

3

Triangle

#

6 3 | P a g e

A programmer using the Direct X 11 or OpenGL pipelines can switch between these two winding orders as needed, so both must be supported by the Tessellator. As stated earlier, the triangles are placed together in much the same way as a strip of triangles.

The strip inside a quad however, has been twisted into a spiral pattern due to the order of the point generation. The first triangle is made from the first point on the outside edge the first point on the first inside ring, and the second point on the outside edge. The next triangle is made from the second point on the outside edge, the second point on the first inside ring, and the third point on the outside edge. The pattern continues until that side of the quad has been traversed.

For points after the halfway point, the direction of the diagonals is flipped in the opposite direction, almost as if the triangles have been calculated backwards, from triangle 6 to triangle 4. In the first example, all of the tessellation factors have been set to 4, causing the rings to artificially line up in a nice manner. When inside and

Figure 40: A Triangle strip made of 7 vertices, v0-v6. Credit: Khronos.org

Figure 41: The triangles are connected in the order along the spiral.

6 4 | P a g e

outside tessellation factors are set to differing levels of detail, issues begin to crop up in the proper sequence of the triangle’s connectivity.

This irregular difference in the number of points can be corrected by calculating the correct triangles connections based on:

o Number of inside points for a given edge. o Number of outside points for a given edge. o Type of needed Diagonal connection.

1. Inside to outside 2. Inside to outside (except middle) 3. Diagonals Mirrored.

Furthermore, since the connections made in the index are completely independent of the actual UV locations of the points, this technique can work for any parallel pair of inside and outside edges. This portion of the algorithm is placed inside the

Figure 42: For the left and right edges, the connectivity between the inside of the quad and the outside of the quad is no longer 1 - 1.

6 5 | P a g e

stitchRegular() function for reuse. The whole connectivity algorithm for quads works thusly:

For each ring o For each outer/inner edge pair

Call stitchRegular Passing:

insideEdgePoint o The starting point for the inner row or column

of points

outsideEdgePoint o The starting point for the outer row or column

of points

numInsideEdgePoints o In the previous figure, there are 3 inside

edge points per edge.

baseIndexOffset o The base index that will be used for emitting

new triangles into the index buffer (an index for the index buffer).

Diagonal type o Based on tessellation factor and side of the

Quad.

Figure 43: In this case, there would be two problem points. Both are highlighted in red.

6 6 | P a g e

In most cases, no incorrect triangle will be generated in the index buffer using the aforementioned method. However, at the end of each ring, the very last triangle always contains a wrong point. The indexing of the final point is what causes this error. Recall that each ring of points begins on a certain start index, and that the very last point of a complete ring would itself be that same start index. The stitchRegular() function has no context or concept of these rings, so instead of detecting the end of a ring and indexing the vertex of the triangle as the first point, it instead believes erroneously that a new point exists. One simple fix to this problem is to create a small lookup table that contains the indices of these incorrect points, along with the new points, and “patch” any inconsistencies as they are created.

A special case also exists when one of the inside tessellation factors is odd. This

causes a degenerate row of triangles that are missed by the ring based triangle connections. Much like the degenerate row of points during point generation, this row of triangles can occur either horizontally or vertically along the U–V axes. The method to handle these two cases is nearly identical to the method used in point

Figure 44: The degenerate row of quads is shown in red.

6 7 | P a g e

generation. After the normal ring algorithm has run to completion, check if either of the two inner factors is odd. Next check which of the two factors will generate the most points. These values have conveniently been pre-calculated in the TessFactorCtx as numPointsForInner[0] (U-Axis) and numPointsForOuter[1] (V-Axis). If the number of points along the U-Axis is greater, a quad strip is connected along the U-Axis. Otherwise if the number of points along the V-Axis is greater a

strip is connected along the V-Axis. If the number of points for both axis are equal, a single quad is connected in the center. After the correct two rows of points are determined, they are processed via stitchRegular(), which connects some number of primitive quads. That is to say, the connected quads have only 4 points and two triangles.

Figure 45: An equal number of points for the inner tessellation factors causes a single quad to be missing from the center.

6 8 | P a g e

After the last triangle is connected and placed into the index buffer, the triangle connectivity has run to completion. Although our understanding of quad connectivity as described above is mostly complete, this is an area of the tessellation project still being researched, so certain details and edge cases still need to be fleshed out. Regardless, the algorithm works closely enough to our current model that we do not feel that our final design will be drastically different.

Parallel Quad Tessellation Design

High Level Design

The input to the quad primitive generator will simply be the six floating point tessellation factors as described in the quad tessellation section. The needed context for each tessellation factor will be placed into a single read-write buffer on the GPU, including the unprocessed tessellation factors. Based on the raw tessellation factors, two buffers will be created on the GPU with sufficient space to act as the Vertex and Index buffers. After the raw tessellation factors have been loaded onto the GPU, a compute shader will be dispatched to process the tessellation factors. When this shader’s execution has completed, four more compute shaders will be dispatched – each with access to the now processed tessellation context. The first two of these shaders will handle point generation while the last two will handle triangle connectivity.

The point generation shaders are dispatched at the same time as the point connectivity shaders since even though these processes seem interdependent,

Figure 46: Each of the needed shaders has access to the six tess. factor contexts.

Raw Factors Tess. F.

Process

Shader

Context [1 – 6]

PointGen

Inner:

Shader

ConnGen

Outer:

Shader

PointGen

Outer:

Shader

ConnGen

Inner:

Shader

Vertex Buffer Index Buffer

6 9 | P a g e

they can be completed separately without any data shared between them. One of the reasons for this is that the points are generated in such a regular pattern that only the number of points and TessFactorContext is needed for the point connectivity.

Detailed Design

Processing Tessellation Factors Each tessellation factor’s context will be stored in the following structure:

TessFactorCtx

(float) invNumSegmentsOnFloorTessFactor

(float) invNumSegmentsOnCeilTessFactor

(float) halfTessFactorFraction

(float) tessFactor

(int) numHalfTessFactorPoints

(int) splitPoint

(Parity) tessFactorParity

(int) numPointsForTessFactor

(bool) isCulled

(bool) isMinimumWork

An array of length 6 will represent the factors.

TessFactorCtx factors[6];

The factors must first be loaded onto the GPU. In direct compute, this means wrapping the data in several layers of buffers, sub-resources, and shader resource views. The reason that so many layers exist is that there are many different types of buffers that can be created on the GPU, and each of these buffers can be configured to interact efficiently with a large number of threads. Shader Resource Views allow for even more customization by ensuring that a shader interacts in a very specific manor with a buffer.

Figure 47: The initial data is loaded into a view that the shaders can access.

Sub-resource RWStructured

Buffer

(Context)

Unordered

Access View

TessFactorCtx Shaders on GPU

7 0 | P a g e

For loading the tessellation factors and their initialized context onto the GPU, the array will first be placed into a D3D11_SUBRESOURCE_DATA. A structured buffer will be created to contain this sub-resource. Because the buffer will be accessed by 4 shaders simultaneously, the shader resource view for the buffer should be an unordered access view (UAV) to allow for multiple shaders to read from it concurrently.

Since the array of TessFactorCtx is of length six, the high level shader language (HLSL) file will dispatch 6 threads, with the dispatch control line looking something like:

[numthreads(6, 1, 1)]

This dispatches six threads in the X dimension when the shader is dispatched from the C++ coding running on the CPU. Each thread is indexed with an X, Y, Z position, so after dispatch the following threads are running:

Thread (0, 0, 0)

Thread (1, 0, 0)

Thread (2, 0, 0)

Thread (3, 0, 0)

Thread (4, 0, 0)

Thread (5, 0, 0)

Processing each tessellation factor in parallel will then be achieved by each thread calculating the TessFactorCtx that corresponds to its own Dispatch Thread ID.x. When all six threads run to completion, the TessFactorCtx will have been filled in, and execution will resume on the CPU so that the shaders responsible for calculating point generation and point connectivity.

Point Generation Because the TessFactorCtx has already been loaded onto the GPU previously, only one buffer needs to be loaded onto the GPU for point generation. This buffer will serve as the Vertex Buffer and will be the output for this stage. Loading the buffer onto the GPU follows a similar pattern to the buffer used for the tessellation context, with two key differences. The buffer is a simple buffer instead of a structured buffer, and it does not require a sub-resource since there is no initial data to load onto the buffer. The signature for the buffer as seen from the point generation hlsl file uses a built in data type:

RWBuffer<float2> vertexBuffer;

7 1 | P a g e

Float2 is a simple data type consisting of an X and Y float, which for the purposes of point output will act as the U-V coordinate storage.

In this proposed parallel implementation, many of the function signatures as described previously for quad tessellation will be identical. The primary difference lies how the nested loops will be traversed. Chiefly, in the shader implementation, they will not be traversed. Instead, the for-loops will be “unwrapped” as much as possible by utilizing clever thread indexing. Take the outer point generation loops as an example:

For each edge o For each point on an edge

placePointIn1D

Four threads will be lunched in place of the outer for-loop instead of its iteration from edge 0 to 3. Unfortunately, in this case, the inner for-loop cannot be unwrapped due to the potentially irregular nature of the four outer tessellation factors. However, the four threads will allow each edge to calculate simultaneously.

Thread 0 o For each point on an edge

placePointIn1D


placePointIn1D


placePointIn1D


placePointIn1D

To have correct placement of points into the Vertex Buffer, each thread must calculate the baseOffset of what its own “first point” will be. Thread 0 has no baseOffset, since it clearly will be placing the first point. Each of the preceding

Figure 48: Both shaders access the Vertex Buffer at the same time, with provisions that ensure they only write to separate locations.

RWBuffer

(Vertex Buffer) Unordered

Access View

Outer Compute Shader Inner Compute Shader

7 2 | P a g e

threads must add in the number of points that the previous threads have calculated. These numbers are already calculated in the TessFactorCtx, so it is just a matter of accessing the information per thread.

(0, 0, 0)

Index 0 Index 1 Index 2 Index 3 Index 4 Index 5 Index 6 Index 7

(1, 0, 0)

Ind. 8 Ind. 9 Ind. 10 Ind. 11 Ind. 12 Ind. 13 Ind. 14 Ind. 15

(2, 0, 0)


(3, 0, 0)


The inner point generation is more regular and both outer for-loops can be unrolled into thread IDs as before.

o For each ring For each edge

For each point on an edge o placePointIn1D

Becomes:

o Thread(ring, edge) For each point on an edge

placePointIn1D

The shader for the inner points will be dispatched with (ring, edge, 1) number of threads. Each ring now has an index based on the Dispatch Thread ID.x coordinate, and each edge has index based on the Dispatch Thread ID.y coordinate. I.E. the thread handling the calculations for the 4th ring on edge 3 would have the dispatch thread id of (3, 3, 0), while the thread handling the 1st ring on edge 0 would be (0, 0, 0).

7 3 | P a g e

Figure 49: An inner point generation with 2 rings and 16 points. A total of 8 threads are dispatched. There are rings 0 and 1. Edges 0 – 3. Each point has the corresponding thread that is responsible for its calculation represented as an ordered triple (ring ID, edge ID, 0).

Because the loops have been unfurled, calculations that a parent loop would normally make once now must be calculated per-child. A good example of this would be the perpendicular U-V parameter that was previously calculated in the edge-loop. Now, this parameter must be calculated by each thread. This small amount of extra work pales in comparison to the amount of parallelism that the new thread indexing provides. The nested method would require 1024 sequential steps to calculate the locations of the inner points for inner tessellation factors of 32 U-Axis and 32 V-Axis. The parallelized method would dispatch (16, 4, 1) number of threads, for a total of 64 simultaneous threads. The threads that perform the most work are the outermost rings, such as thread (0, 0, 0), which will loop at most 30 times. This naïvely seems as if the parallel version takes only 2.9% the time of the sequential version, but this is not a true order analysis, and is not based off of actual instruction count or execution time. This is an area that will require testing to determine the actual speedup, if any.

(0, 0, 0)

(0, 0, 0)

(0, 0, 0)

(0, 1, 0) (0, 1, 0) (0, 1, 0) (0, 2, 0)

(0, 2, 0)

(0, 2, 0)

(0, 3, 0) (0, 3, 0) (0, 3, 0)

(1, 0, 0)

(1, 1, 0) (1, 2, 0)

(1, 3, 0)

7 4 | P a g e

Point Connectivity The Tessellation Context that the connectivity needs will have already been loaded onto the GPU by the tessellation processing stage. The point connectivity will also launch two compute shaders, one for the outside ring of triangle connections, and one for the inside ring. The shaders will access the index buffer each time a triangle is emitted. This index buffer will be loaded into a RWBuffer of integers and the shaders will have access to the RWBuffer via another unordered access view (UAV).

The current proposed design for the shaders is nearly identical to that of the inner point connectivity. Since there is a regular number of rings, and a fixed number of sides for each ring, the nested loops can be unwound in the same fashion. The shaders will be dispatched with ring number of shaders for the X IDs and edge number of threads for the Y IDs.

Attempted Parallel Implementations The actual implementation ended up being a fair bit different from the proposed

implementation for a number of reasons, the primary being that of performance

concerns. During the initial design phase, an overlooked detail was the impact of

dynamic flow control when performing branching based on thread IDs. When the

GCN efficiently processes a group of threads, they all should ideally all be

executing the same instruction across the entirety of the group. After studying the

architecture in further depth, this makes logical sense, since each group physically

executes on a SIMD core. Normal branch instructions do not necessarily pose the

same risk, because although a branch is taken, all threads who encounter the

branch will take it. Unfortunately, my initial design involved placing threads in

situations that would guarantee divergence between all threads. A second design,

one that had mid-semester, involved completely unrolling all loops instead, with

the hope that a large sequential access would allow for great cache performance.

The primary issue with this approach is that it too requires all threads to diverge

significantly from each other. A secondary issue, and also a primary cause of the

divergence, would be that every thread now must correctly calculate its own

position within the buffer, the current ring that it lies in, and the edge the position

Figure 50: After the index buffer is loaded onto the GPU, the shaders access it via the UAV.

RWBuffer

(Index Buffer) Unordered

Access View


7 5 | P a g e

is on, all of which must be different values, arrived at through divergent

calculations. Worse yet, that is only the initial setup stage for each location in the

output buffer. After these values have been calculated, the calculations for either

the vertex or index buffers must be performed, which also creates divergence

between all threads.

The tessellated primitive for quadrilaterals presents a unique challenge since the

patterns necessary to produce correct output can be quite complex. Verification of

this correctness was achieved through two methods: a side by side visualization

of the output pattern, and an iterative test script. The test script increases all

tessellation factors by a uniform step of 0.01. The parallel implementation was

tested from factors [0, 65] with all cases passing. The script ensures that the

contents of each index buffer match exactly, while allowing a difference of 0.0005

for the uv coordinates in the vertex buffer. This accepted difference accounts for

inaccuracies in the 32 bit fixed point reference when compared with our more

accurate IEEE float implementation. For performance, our implementation takes

advantage of the quad’s square structure, since even when subdivided, the quad

will still be composed of quads. There are two distinctions, however. Each

outermost edge may be subdivided by a different factor, while the remaining

interior subdivides in NxM dimensions. The reference algorithm provided from

Microsoft emits points in an outside to inside square spiral in order to ensure there

are never any duplicate points during triangulation. An added benefit of this

ordering is that the triangulation may be performed independent of point

generation. To take advantage of this inherent parallelism, two thread groups

dispatch to perform these calculations. Each thread group has a group size of 64

to take advantage of the AMD’s GCN architecture, which allows for logical

execution of 64 wide thread groups simultaneously. A group of threads calculates

the data for all points on an edge, and edge at a time. When the kernel is first

lunched, it calculates the aforementioned magic numbers and stores the results in

groupshared memory, a low-level data store. The number of threads each kernel

uses totals 128. Because of the relatively small kernel size, multiple dispatches

may be lunched at once, allowing many patches to be processed without filling

more than a few compute units.

7 6 | P a g e

Experimental Results:

Design Summary

Isolines Our final Isoline implementation is a shader that calculates the vertex and index

data for a single point. This dispatch grouping of the shaders is based on the

tessellation factors. If the output patch would resemble a cube then the work would

be dispatched into groups of threads that are responsible for 8x8 segments of the

output to reduce resource usage. Otherwise, a group of threads would be

dispatched for each row of points.

0

0.5

1

1.5

2

2.5

Tim

e (M

S)

Tessellation Factors

CPU vs GPU

CPU VS GPU RUN TIME (IN MS) OF TESSELLATOR AT VARIOUS

INPUTS Tessellation factors CPU GPU 16x16x16x16x16x16 0.00718 0.31857 32x32x32x32x32x32 0.02932 0.76839 48x48x48x48x48x48 0.06026 1.47739 64x64x64x64x64x64 0.11265 2.36981 32x16x8x6x4x2 0.00156 0.13351 64x15x23x14x13x46 0.01564 0.46012

7 7 | P a g e

Triangles The two primary tasks of tessellation, point generation and connectivity, are split

into separate shaders that run in parallel. Each shader computes its’ own copy of

the necessary derivative values of the tessellation factor contexts (processed

tessellation factors). This avoids unnecessary, expensive communication between

the CPU and GPU at the cost of statically dispatching threads to the shaders

because no prior information about point generation or connectivity are known

before the shader dispatch calls—this leads to wasted threads at lower tessellation

values. For point generation, 67-by-64 threads are dispatched to compute a point-

per-thread.

Quads The quad tessellation begins by processing the input tessellation factors by using

a single compute shader calculate the tessellation factor contexts for each the 6

factors.

1. A RWStructured buffer is loaded onto the GPU to hold the TessFactorCtx

2. A compute shader with six threads is dispatched to process the factors.

a. [numthreads(6, 1, 1)]

3. After the previous shader has finished executing

a. A RWBuffer is loaded onto the GPU to be used as the Vertex Buffer

b. A RWBuffer is loaded onto the GPU to be used as the Index Buffer.

The number of threads required to finish point generation are calculated based on

information from the tessellation factor context. The outermost point generation

requires no additional calculations, so the compute shader that handles the outside

Raw Factors Tess. F.

Process

Shader

Context [1 – 6]

PointGen

Kernel

ConnGen

Outer:

Kernel

Vertex Buffer Index Buffer Figure 51: The high level diagram for quad tessellation

7 8 | P a g e

points is launched at this point with 4 threads – one for each edge of the quad. In

order to calculate the number of threads to dispatch for the inner quad point

generation, the number of inside rings is calculated from the TessFactorCtx. After

this calculation, ring by 4 number of threads are dispatched to handle calculate the

inside points. Both of these compute shaders output their points into the vertex

buffer. When they both have completed execution, quad point generation has been

completed.

At the same time that the two point compute shaders are being prepared for

dispatch, the two quad connectivity shaders are also about to launch. For brevities

sake, it is worth noting that the number of threads and indexing of said threads is

nearly identical to the threads dispatched for point connectivity. After the proper

number of threads is calculated for the two connectivity threads, they will be

dispatched. Due to the regular nature of the point generation’s spiral pattern, the

point connectivity can be stitched completely separate from the point generation.

In practice, all the connectivity algorithm needs to generate correct output is the

Tess Factor Context.

Project Administration

Facilities and Equipment

The facility that we typically use in team collaboration is the EECS senior design lab. This facility is extremely new and clean, which is one of the things that makes the lab such a great meeting place. It is important that we keep the room’s relaxing and clutter free atmosphere intact, especially since there will be more teams in the coming semesters that will use it.

Personal Work

RWBuffer

(Index Buffer) Unordered

Access View


Figure 52: Index Buffer for quad connectivity

7 9 | P a g e

Erwin Holzhauser The project provides a satisfying balance of research and implementation and an opportunity to branch out to technologies and concepts not covered in a standard computer science curriculum. Research-wise, this project provides the opportunity to familiarize oneself with a deeper general understanding of computer graphics and parallelism, the role of tessellation in rendering surfaces, and the potential performance trade-offs of fixed-function hardware versus equivalent software implementations. Technology-wise and implementation-wise, this project provides the opportunity to learn shading language to the level of proficiency of building on an existing code base and working with cutting-edge graphics cards. Finally, the implications of improved tessellation methods on CAD and video game applications make this a very attractive project.

Matthew Faller My passion for the last few years has been learning about the algorithms that are used in computer graphics and game development. My focus thus far has been 2d algorithms and structures such as quadtrees, collision detection and openGL’s fixed function pipeline. This project is an exciting opportunity to learn about programmable graphics, and in particular, parallel computing. Tessellation is a subject that also interests me from a 3d modeling standpoint, since I also use 3d software for design. Naturally, Catmull-Clark is one of my favorite subdivision algorithms.

David Sierra I do not really have any real experience with HLSL, GPU programming, or parallel

programming in general. I chose this project because I wanted to get into

massively parallel programming on the GPU. I first learned about parallel

computing when I read an article in Wired. The article was about high speed trading

computers on Wall Street and the industry of having the fastest software and

internet connection in order to gain an edge on the stock exchange. Although not

massively parallel, the apps that I have written where I have to spawn work in

multiple thread have been the most engaging ones to write so far. There is

something about the challenge of synchronizing packets of work that makes those

programs more fun to write. One thing I did have experience with going into this

project was Visual Studio and setting up big C++ projects in it since it is something

I have to work with at my internship. My biggest advantage going into this project

is my ability to go into someone’s existing code base and be able to find my way

around and figure out what they were doing. During the first semester of this

course, I was also enrolled in COP4331 where we had to make an Android

application that had to query multiple web services. For that project I took charge

of the background querying of Google services and had to manage multiple

threads and callback interfaces. It turned out to be pretty fun and got me even

8 0 | P a g e

more interested in parallel programming. So next semester our whole group is

enrolled in COP 4520, which is the Concepts in Parallel and Distributed Computing

class. Overall I am sure it will be an “exciting” semester in the spring.

Lessons Learned

Erwin Holzhauser Prior to this project, I had negligible Java multithreaded programming experience

and absolutely no shader programming or graphics experience. Even then, Java

uses a different threading model than Direct Compute shaders; instead of globally

assigning work to threads, a Direct Compute shader is written as an instance that

each individual thread will run.

Upon completion, I feel more comfortable with Direct Compute, its’ threading

model, and the DirectX 11 graphics pipeline.

As with the completion of any larger-scale program, I feel more confident in my

abilities to research a new area of computer science, or learn a new technology.

Our experience with Direct Compute was painful as the technology is not heavily

documented online, but that only reinforced self-direction in learning new

technologies.

I also feel more comfortable with debugging code; while I had some experience

with the Java logger, most of my debugging of C/C++ applications has been with

planted print statements at points of interest. Now, I feel comfortable with

debugging tools such as Visual Studio’s which allow you to set breakpoints and

step into code.

While proactive time management was not a new lesson, it was certainly reinforced

during this project out of pure necessity.

Matthew Faller Although I had a fair amount of experience writing graphics programs for fixed

function pipelines, I had never had the chance to dabble in any type of shader,

which of course is somewhat sad for a computer graphics enthusiast. To me,

shaders were some mystical black box that made incredible things happen behind

the scenes in my favorite game engines. After working on this project I have now

written: vertex, pixel, geometry, and compute shaders. This is in addition to

learning how the hull and domain shaders function. The most amazing thing is that

the project has expanded my breadth of knowledge in more areas that simply

computer graphics and parallel programming. I knew some C++ going into the

project, but had never written much more than smaller academic projects in the

language. Although not as intuitive as some languages, I know that I’m a much

better programmer having worked in C++ for a year while learning a new (and

complex) API.

8 1 | P a g e

One lesson that I’ve learned is that DirectCompute is best left to projects were the

results of a kernel must immediately be passed back into the graphics pipeline.

The threading model is nearly identical to the widely popular openCL, yet has fewer

online tutorials, requires more configuration, and has a life interleaved with the

D3D11 pipeline. For compute shader projects going forward, I intend on using

mostly openCL and openGL.

David Sierra Going into this project I had no idea about writing programs for efficiency outside

of what I learned in cs1 and cs2. I also didn’t know anything about parallel

programming. This project gave me tons of insight into both of these. Especially

with how to distribute work into efficient chunks and memory accessing techniques.

It also gave me practical experience developing my own large applications using

C++11. Another pretty important lesson I learned was to manage my time better.

We would have gotten so much more work done if we had just spent 1 more month

doing actual work. The most important thing I think I learned in this project is

managing memory in large applications. At some points, my test app was leaking

hundreds of megabytes of memory a second and it sure was fun tracking that

down.

Project Plan and Milestones

Project Phases:

The project will be broken up into a number of phases, listed below.

1. Plan 2. Research 3. Design 4. Prototype* 5. Implement 6. Test

*Current Phase in Bold

The most important constraints on this project are schedule and scope. The project must be finished by the senior design presentation deadline, otherwise it is a failure. The project must also adhere at minimum to the scope outlined in the specifications section.

Phase Estimated Duration

Plan Sep. 10 – Oct. 10

Research Oct. 10 – Oct. 31

Design Oct. 15 – Nov. 7

8 2 | P a g e

Prototype Nov. 7 – Dec. 4

Implement Dec. 4 – Mar. 10

Test Mar. 10 – Apr. 10

Milestones:

Fall Semester:

Milestone: Date: Status:

HLSL Hello, World! Program

Setup Wiki

September 26 Completed Early

Write sample program with single thread

Setup GitHub and standardize development environment amongst group members

October 3 Completed Early

Write sample program with multiple threads

October 10 Completed Early

Signed NDA October 20 Completed Late

Survey of contemporary tessellation algorithms and methods

October 24 Completed

Began Reference Code Analysis October 29 Completed Late

Survey of relevant AMD code-base October 31 Completed Late

Detailed design of software architecture November 7 Completed Late

Naïve,single-threaded implementation of system

December 3 Late

Spring Semester:

Milestone: Date:

Test harness interface complete

The test harness interfaces with the 2d visualizer provided by AMD. This allows for easier unit testing and visual debugging, which will be vital for connectivity. This will also allow for batch tests of every tessellation factor.

December 19 Started

Naïve implementation for isolines December 30 Started

8 3 | P a g e

This is a shader implementation for isolines that is still as close as possible to the serial reference.

Naïve implementation for triangles

This is a shader implementation for triangles that is still as close as possible to the serial reference. This includes both inner-outer point generation and inner –outer connectivity.

Jan 10 Started

Naïve implementation for quads

This is a shader implementation for quads that is still as close as possible to the serial reference. This includes both inner-outer point generation and inner –outer connectivity.

Jan 10 Started

Naïve output matches reference

Using the test harness, run tests to ensure that all output match the output expected by the reference rasterizer.

Jan 11 Incomplete

HLSL Tessellation of lines

A highly parallelized version of isoline primitive generation. (Points and Connections)

January 23 Incomplete

HLSL Tessellation of triangles

A highly parallelized version of triangle primitive generation. (Points and Connections)

February 20 Incomplete

HLSL Tessellation of quads

A highly parallelized version of quad primitive generation. Both Points and Connections

February 20 Incomplete

Integration, optimization, until finalized multi-threaded HLSL implementation

March 15 Incomplete

Integration Testing March 28 Incomplete

8 4 | P a g e

Testing Methodology

The way in which we were to validate our tessellator output was provided as a

fixed specification by AMD. They wanted the vertex and index buffer of our

tessellator to match the vertex and index buffer of their reference rasterizer bit for

bit.

Once we received their rasterizer we noticed a problem right away. Their rasterizer

uses fixed point arithmetic as opposed to standard floating point numbers. Fixed

point arithmetic is when you represent numbers with fractional parts by storing the

fractional and whole parts in a certain set of bits that never change. In practice this

number is stored as an unsigned 32 integer. For example AMD’s reference

Figure 53: Screenshot of AMD's reference rasterizer

8 5 | P a g e

rasterizer stores their fixed point numbers with 1 bit dedicated to being the sign bit,

15 bits are reserved for the integer, and 16 bits reserved for the fractional part. The

big upside to this is that you can do fractional math using very fast integer

arithmetic hardware. And since AMD used a fixed piece of hardware to accomplish

tessellation, this was obviously in their favor since they wanted to minimize space

taken up by the hardware. Especially because when the hardware is not

tessellating, it is just sitting there doing nothing.

In our project though, the graphics cards use standard IEEE 754 floating point

calculations and have hardware that accelerates floating point math. Since floating

point arithmetic is generally more accurate than fixed point arithmetic we were

encouraged to use it and take advantage of the GPU’s hardware acceleration

capabilities. A direct consequence of all this is that our floating point output only

matches their reference output to, on average, 3 significant figures.

Testing Harness In order to streamline testing we have chosen to integrate the reference rasterizer

given to us by AMD with Google’s Google Test. Google Test is a C++ testing

framework that allows us to easily integrate tests into our code and make it simple

to test a large amount of values in a short amount of time. Google Test does this

by providing an extensive set of assertions. An assertion is a procedure that

resolves a boolean expression. If the boolean value does not evaluate to the

expected value, then the program quits and output’s an error. Google Test also

makes it easy to insert custom print statements into test case outputs for even

more accurate testing. This will provide us with relatively instant feedback when

we make code changes to any part of the algorithm. These small changes will be

extremely numerous during the implementation phase as we scour our

implementation for optimizations. For example, we can have a test case where we

instantiate 2 tessellators, the C reference tessellator and our HLSL

implementation. We can then pass them the same input values and run them. After

that we can store the output vertex and index buffers and compare them in a

standard for loop. Inside the for loop is a Google Test assertation that expects the

Figure 54: Representing a floating point value in fixed point notation

8 6 | P a g e

two values to be the same. When the values differ we can get output to the screen

telling us of the error and the program will exit. Google Test can handle hundreds

of test cases at a time so we can get a massive amount of testing done in a

relatively short amount of time.

To make testing even simpler, the rasterizer provided to us defines an interface for

tessellators. This means we can have our tessellator implement their interface and

be able to plug it right into the code they provided. This makes it extremely easy

to diff results and run test cases with Google Test. We can even overlay our

calculations onto the reference calculations on the screen and get visual feedback

on our errors.

Test Cases

Test Objective Correctness of Degenerate Triangle for Even Inner Tessellation Factor

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 4.0, t1 = 3.2, t2 = 1.7, i0 = 4.0. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. This is an edge case, because non-odd tessellation factor parity of the inner tessellation factor results in a ‘degenerate triangle’—a single point—as the center ring.

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to triangle and the tessellation mode is set to integer.

Expected Results

The HLSL implementation of the tessellation algorithm matches index and vertex buffer output of the Microsoft reference rasterizer.

Figure 55: Google test output

8 7 | P a g e

Test Objective Correctness of Fractional Odd Mode Tessellation of Triangles

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 1.0, t1 = 1.0, t2 = 1.0, i0 =1.0. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. This is an edge case, because fractional odd tessellation of triangles requires a minimum inner tessellation value of 1+2-

16. If the inner tessellation value is not clamped correctly, the outer ring will overlap with the inner ring.

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to triangle and the tessellation mode is set to fractional odd.

Expected Results


Test Objective Correctness of Clamping For Fractional Even Mode Tessellation of Triangles

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 1, t1 = 1, t2 = 65, i0 = 5. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. For triangles, fractional even tessellation mode clamps outer and inner tessellation values are clamped between 2 and 65.

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to triangle and the tessellation mode is set to fractional even.

Expected Results


Test Objective Correctness of Clamping For Integer Mode Tessellation of Quads

8 8 | P a g e

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 1, t1 = 66, t2 = -1, i0 = 5, i1=7. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. For quads, integer tessellation mode clamps outer and inner tessellation values within the range of 1 through 64.

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to quad and the tessellation mode is set to integer.

Expected Results


Test Objective Correctness of Degenerate Vertical Quads

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 3.0, t1 = 1.2, t2 = 4.1, i0 = 6.0, i1 = 3.2. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. This is an edge case, because the occurrence of either inner tessellation factor being even, coupled with the first inner tessellation value being greater than the second, results in a vertical degenerate quad—a vertical row of single points—at the center.


Expected Results


Test Objective Correctness of Degenerate Horizontal Quads

8 9 | P a g e

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 3.0, t1 = 1.2, t2 = 4.1, i0 = 3.2, i1 = 6.0. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer. This is an edge case, because the occurrence of either inner tessellation factor being even, coupled with the second inner tessellation value being greater than the first, results in a horizontal degenerate quad—a horizontal row of single points—at the center.


Expected Results


Test Objective Correctness of Integer Mode Tessellation of Isolines

Test Description

Input to the HLSL implementation of the tessellation algorithm tessellation factors t0 = 1.8 and t1 = 6.2. Check the output of the vertex and index buffers generated by the HLSL implementation against the output of the vertex and index buffers output by the Microsoft reference rasterizer.

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to isoline and the tessellation mode is set to integer.

Expected Results


Test Objective Correctness of Fractional Odd Mode Tessellation of Isolines

Test Description


9 0 | P a g e

Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to isoline and the tessellation mode is set to fractional odd.

Expected Results


Test Objective Correctness of Fractional Even Mode Tessellation of Isolines

Test Description


Test Conditions For the HLSL implementation and Microsoft reference rasterizer, the tessellation type is set to isoline and the tessellation mode is set to fractional even.

Expected Results


Error Reporting Conventions Here is our template that is used in reporting bugs and fixes. This type of reporting is important to a large project of this nature since errors should be made a tool to learn from. On google docs is where we will be reporting any sort of bugs and problems. Below is just a generalized template to follow; if a bug requires other entries or does not fit into the template, alterations made be made. Because our team set up a bit bucket account, we also have access to special automated bug tracking software. It might be best to use their automated issue tracking.

Problem: <Name>

Error Code: <my weird error code or stacktrace hear>

Description:

9 1 | P a g e

<A description of the problem/bug that fully discloses what went wrong, to our best understanding. If there are things that we do not yet fully understand also list them.>

Fix: <What is the solution or work around for the problem/bug? Feel free to also post relavent links to external websites, i.e. stackoverflow.com>

Reported By: <Your Name!> Add the error report to the bug on bitBucket to save time:

Problem: vgt_te11_reorder.hpp and .cpp missing.

Error Code: error C1083: Cannot open include file: 'vgt_te11_reorder.h': No such file or

directory

Description: vgt_te11_reorder.hpp and .cpp missing from the reference rasterizer.

Fix: We met with Todd from AMD and he allowed us to also have access to the needed

files.

Reported By: Matt, David

Problem: Directx Debug Build

Error Code: Looked like this: 'Shaders.exe': Loaded 'C:\WINDOWS\system32\user32.dll', Cannot find or open the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\gdi32.dll', Cannot find or open the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\ole32.dll', Cannot find or open the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\advapi32.dll', Cannot find or open the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\rpcrt4.dll', Cannot find or open the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\secur32.dll', Cannot find or open the PDB file

Description:

9 2 | P a g e

When trying to build and run a directx program that compiled a simple compute shader, the loader could not find certain files it needed.

Fix: Goto: tools→ options → debugging → symbols, and check the box that lets you download anything you do not already have.

http://stackoverflow.com/questions/12954821/cannot-find-or-open-the-pdb-file-in-visual-studio-c-2010

Reported By: Matt

Problem: Downloading wrong DX SDK

Error Code:N/A

Description: The old DX SDK is no longer its own separate thing, but is bundled into the windows SDK

Fix: Download the Windows SDK instead from here.

Reported By: Matt, Erwin

Problem: Shader Won’t Compile

Error Code: Failed compiling shader:... 80004005

Description: The compute shader in the .hlsl file did not compile when passed to the microsoft function

D3DCompileFromFile(srcFile, defines, D3D_COMPILE_STANDARD_FILE_INCLUDE, entryPoint, profile, flags, 0, &shaderBlob, &errorBlob);



http://msdn.microsoft.com/en-us/windows/desktop/bg162891.aspx

9 3 | P a g e

Fix: The entry point was incorrect. Make sure that the entry point you pass to the above function matches the entry point in your .hlsl code.

Reported By: Matt and David

Problem: Visual Studio Crashes

Error Code: Visual Studio has stopped working.

Description: When using watch expressions in conjunction with Fixed point math, sometimes while stepping through or adding a new value, the debugger will crash visual studio. Fix: Still open! We have not sure what causes this…

Reported By: Matt, Erwin, and David

Problem: Visual Studio Project Will not build

Error Code: Build option becomes greyed out and will not allow the project to build.

Description: Other projects in visual studio build without fail, but one particular solution will not allow the user to build / run or just plain build. Fix: The project was failing to build due to a problem with a particular folder. For

whatever reason, the folder inside the project was set to read-only, causing the

visual studio to not be able to write to the project. This meant that on an attempted

build the compiler could not link to the PDB file. The solution allowed building once

the folder option for read-only (and all sub folders options as well), was unchecked.

Reported By: Erwin

9 4 | P a g e

Project Summary and Conclusions Our project, Parallel Tessellation Using Compute Shaders, at its core is a software

porting job. We’re taking a specification designed to run on a fixed function piece

of hardware and porting it to a new language and piece of hardware in hopes that

it can do a better job than the fixed function hardware.

The formal request is that we design and implement a software tessellator written

in Microsoft’s High Level Shader Language (HLSL). This software implementation

will attempt to outperform the fixed function hardware by taking advantage of the

GPU’s highly parallelizable vector processors (compute units). AMD’s compute

units can perform the same instruction on up to 64 different pieces of data enabling

a massive amount of parallelization.

The software implementation should also not consume an excess amount of

resources in order to achieve its throughput numbers. In addition to achieving its

performance goals, it must also match its output to a reasonable degree of

accuracy (around 3 significant figures). Some slack was given for the accuracy

because the fixed function hardware and software implementation use different

number formats. The fixed function hardware uses fixed point decimal numbers

while the GPU uses standard IEEE 754 floating point values. This causes some

discrepancy in the numbers as the two formats store the fractional parts of the

number in different ways.

Along the way we encountered many problems. Firstly the tessellation spec was

truly huge and took a lot of time to even begin understanding. Secondly getting

some real performance out of the GPU was much harder than we anticipated. The

CPU even proved very difficult to outperform. Thirdly, Microsoft’s DirectX api is

extremely massive and took a long time to get a grasp of. Lastly, was the time

frame we were given. Seven months just is not enough time to learn tessellation,

DirectX, GPU programming techniques, and then actually have enough time left to

really start optimizing its performance.

parallel tessellation using compute shaders group 1: · pdf file ·...

Documents