· web viewroam treats a terrain model as a rectilinear elevation map. the algorithm off-line...

Chapter 1

Introduction

Various software applications such as games, simulators, medicine, design and engineering

need to display highly detailed images. These images are frequently needed to be dynamic

and sometimes even interactive. One of the main goals of the computer graphics field is to

find ways to display these highly detailed dynamic images at an interactive frame rate.

If we look at another device that shows dynamic images – the television, we can see that it

displays 24 to 30 frames per second. With a frame rate lower than 24, the viewer might

detect popping effects and discontinuity of the displayed images.

The television displays at each frame the image it receives from an input device such as

cables, satellite, antenna or video camera. Contrary to that, a computer usually has to create

these images at run-time. For a given model, it computes the view image from the position

and direction of a viewer or a camera. This process is called rendering.

The graphics hardware renders models that usually consist of computer graphics

primitives, such as vertices, edges, and polygons. The rendering time of a given model is

mainly determined by the number of polygons sent to the graphics hardware, and the

computational power of this hardware. This implies low frame rates when rendering large

models, such as terrains, industrial designs, and weather simulations.

To overcome such limitations, two approaches could be taken. The first approach is to use

stronger and more powerful graphics hardware. This solution is not sufficient, because even

the most powerful graphics hardware available has a limit on the number of polygons it can

render at interactive frame rates. Furthermore, the software has little to do with these

1

limitations, and the best that can be done is to use pre-computed acceleration techniques

provided by the APIs (OpenGL or DirectX) to try and maximally harness the computational

power of the hardware.

The second approach is selectively reducing the number of polygons sent to the graphics

hardware simplifying the geometry of the model. Here, the software can control this number

according to the size of the model, and the computational power of the hardware. However,

there is a tradeoff here, because sending only a part of the model to the hardware implies that

the image displayed will be less detailed. If the model is not downsized appropriately it might

even result in an inaccurate display that does not represent the original model faithfully.

This is the reason why many algorithms for geometric simplifications were introduced in

the last decade. A large portion of these algorithms deals with view-dependent level-of-detail

rendering. These view-dependent algorithms downsize the number of rendered polygons by

reducing the resolution (number of polygons per area unit) of the model. The level of detail

of each region of the model is selected according to the position of the model with respect to

the viewpoint. Regions close to the viewpoint remain in a high level of detail, while regions

of the model that are far from the viewpoint have a much reduced level of detail. These

algorithms reduce the size of the model, but mostly in its less important areas, thus the result

is a higher frame rate, with minimal damage to the quality and detail of the displayed image.

Level-of-detail algorithms are mainly dependent on the CPU, which is relatively slow and

often overloaded. These limitations of the CPU create a bottleneck. In contrast, the graphics

hardware is usually faster than the CPU, and less loaded. Another bottleneck occurs in the

communication between the CPU and the graphics hardware due to the huge amount of data

sent to the hardware at each frame. Recent advances in the programmability of the graphics

hardware allow us to relieve the CPU from some of the work load, and to downsize the

communication load too. In this work we introduce two algorithms that harness the growing

power of current graphics hardware.

The first algorithm caches geometric data on the on-board memory of the graphics

hardware, thus reducing the data traffic between the CPU and the graphics hardware and help

fighting the communication bottleneck.

2

The second algorithm uses the enhanced programmability of current graphics hardware to

relief the CPU of almost all level-of-detail computations. Most of the computations are done

in the graphics hardware, and this implies better load balancing between the CPU and the

graphics hardware.

3

1.1 Graphic Hardware Background

1.1.1 What is a GPU?

GPU stands for "Graphics Processing Unit". This term was introduced by NVIDIA [21] in

the late 1990s when the old terms were no longer an accurate description of the graphic

hardware in a PC.

A GPU is a specialized single-chip processor, designed to draw 3D graphics. As such, it is

much faster than the CPU for typical tasks involving 3D graphics. It creates lighting effects

and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive

tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the

CPU frees up cycles that can be used for other jobs.

1.1.2 The Potential of GPUs

Over the past 5 years, GPU technology has advanced at an incredible pace. The rendering

rate, as measured in pixels per second, has been approximately doubling every 6 months

during those 5 years. Taking into account the heavy workload that CPUs already have to deal

with, due to their multi-purpose usage, it is a good idea to do some load balancing, by letting

the GPUs do more work.

The recent advances in GPU programmability and precision (32 bit floating point

throughout the pipeline) enable us to offload work from the CPU to the GPU, resulting in an

overall speedup in typical applications.

4

1.1.2.1 Computational Power

In 2004 as experimented by Buck and Purcell [3], a fragment program running on the Nvidia

GeForce FX 5900 achieved over 20 GFLOPS (Giga floating-point operations per second),

and compared to the Pentium4 3 Ghz theoretical 6 GFLOPS, it is clear that the GPUs are

already faster than the CPUs. Keeping in mind that CPU technology is struggling to keep up

with Moore’s law [20] of the doubling of transistors every couple of years, the doubling of

rendering rate every 6 months by the GPUs suggests that GPUs have not only passed CPUs

in performance, but will also continue to outpace CPUs in the future. This comprehension is

not surprising for two major reasons.

The first reason is the specialized nature of GPUs that make it easier to use additional

transistors for their computations. Generating images is a very parallel problem – Graphic

hardware designers can repeatedly split up the problem of creating realistic images into more

chunks of work that are smaller and easier to tackle. Then hardware engineers can arrange, in

parallel, the ever-greater number of transistors available to execute all these various chunks

of work.

The second reason is purely economic – The multi-billion dollar video game market is a

pressure cooker that drives innovation in this field.

1.1.2.2 Programmability

The dominant trend in graphics hardware design today is the effort to expose more

programmability within the GPU. A part of the graphic hardware pipeline’s units that were

once configurable at the most, are now becoming more and more programmable. The Vertex

Processor and the Fragment Processor units as will be explained later in this chapter are

already programmable, and their programmability is increasing with each generation of

GPUs.

Apart from the ability to program some of the units in the graphic hardware pipeline

which is important by itself, the programming environment is very important too – Drowning

5

into the world of graphic hardware instructions is not all that fun. In the last few years, a few

languages for programming the GPU were introduced, such as Sh and Cg [11]. These

languages offer a friendly high-level environment that translates the users programs into a

form that the GPUs hardware can execute.

1.1.3 Historical Background

Prior to the introduction of GPUs, graphics hardware was specialized and expensive. Many

of the concepts, such as vertex transformation and texture mapping were introduced then,

making those systems very important to the historical development of computer graphics, but

because they were so expensive, they did not achieve the expected mass-market success.

In the late 1990s the first generation of GPUs was introduced. When running most 3D and

2D applications, these GPUs completely relieve the CPU from updating individual pixels.

However, GPUs in that generation suffer from two clear limitations. First, they lack the

ability to transform vertices of 3D objects, and vertex transformations occur in the CPU,

instead. Second, they have a limited set of math operation for combining textures to compute

the final color of the pixels.

In 1999 the second generation of GPUs was introduced. Fast vertex transformation was

the main improvement of these GPUs. Although the set of math operations for combining

textures and coloring pixels expanded in this generation, it made this generation more

configurable, but still not truly programmable.

In 2001 the third generation of GPUs was introduced. These GPUs let the application

specify a sequence of instructions for processing vertices, and by that providing vertex

programmability rather than merely offering more configurability. Considerably more pixel-

level configurability was available, but these modes were not powerful enough to be

considered truly programmable. Because these GPUs support vertex programmability but

lack true pixel programmability, this generation was a transitional one.

In 2002 the forth and current generation of GPUs was introduced. These GPUs provide

both vertex-level and pixel-level programmability. This level of programmability opens up

6

the possibility of offloading complex vertex transformation and pixel-shading operations

from the CPU to the GPU.

All the new graphic cards support the Shader Model 3.0, which gives cutting-edge

programmability for both the vertex and pixel processors.

1.1.4 The Graphics Hardware Pipeline

Figure 1.1: The graphics hardware pipeline.

1.1.4.1 Vertex Transformation

The Vertex Transformation stage (sometimes referred to as T&L – Transform and Lighting)

performs a sequence of math operations on each vertex. Its input is the list of vertices

received from the software running on the CPU. Each vertex has a position, and usually

several other attributes such as a color, a secondary color, texture coordinates and a normal

vector.

The operations performed on each vertex in this stage are transforming the vertex

positions into image positions, generating texture coordinates for texturing, and lighting the

vertex to determine its color.

7

Figure 1.2: Transformation of vertex positions to image positions, and lighting of the vertices.

1.1.4.2 Primitive Assembly

The Primitive Assembly stage assembles vertices into geometric primitives. Its input is the

list of transformed vertices as outputted by the Vertex Transformation stage along with the

vertex connectivity information received from the software running on the CPU.

This stage assembles the transformed vertices into geometric primitives based on the

geometric primitive batching information that accompanies the sequence of the original

vertices. This results in a sequence of triangles, lines and points.

Figure 1.3: Primitive Assembly stage.

A

A

B

C

D B

C

D

A

D

C

B

A

D

C

B

8

1.1.4.3 Rasterization

The Rasterization stage determines the set of pixels or fragments covered by a geometric

primitive. Its input is a triangle, line or point as outputted by the Primitive Assembly stage.

First, the primitive may require clipping to the view frustum, i.e. removing a primitive that

is completely outside the field of view, or truncating a primitive that is partially in it.

After the clipping, a primitive may also be discarded in a process known as backface

culling, in which a polygon is discarded based on whether it faces backwards to the view

point.

A Primitive that survives the clipping and culling steps is then rasterized. Polygons, lines

and points are each rasterized according to the rules specified for each type of primitive. The

results of rasterization are a set of pixel positions as well as a set of fragments. The pixel

positions of a primitive are the pixels that will be actually lit on the screen, if this primitive is

to be displayed. A fragment is in the exact size of a pixel, but it has some more attributes

associated with it, such as a depth value, a color, a secondary color and texture coordinates.

These parameters of a fragment are derived from the transformed vertices that make up the

geometric primitive used to generate the fragment. A fragment is actually a potential pixel –

If it passes various tests in the Raster Operations stage, the fragment updates a pixel in the

frame buffer.

Figure 1.4: Rasterization stage. Note that before the Rasterization actually occurred, the back face of the pyramid (the ABD face) was discarded in the backface culling process.

D

C

B

A

9

1.1.4.4 Fragment Interpolation, Texturing, and Coloring

The Fragment Interpolation, Texturing, and Coloring stage determines the final color for

each fragment. Its input is a fragment as outputted by the Rasterization stage.

The fragment’s parameters are interpolated as necessary, then a sequence of texturing and

math operations are performed to determine the final color of the fragment. In addition, this

stage may also determine a new depth for the fragment or even discard the fragment.

Figure 1.5: Interpolation and coloring of the fragments.

1.1.4.5 Raster Operations

The Raster Operations stage performs a final sequence of operations before the frame buffer

is updated. Its inputs are the finalized fragments.

During this stage, a series of tests are performed on each fragment. If any test fails, this

stage discards the fragment without updating the pixel’s color value. These tests include

scissor, alpha, stencil and depth tests, as the later eliminates hidden surfaces according to

their depth.

After the tests, a blending operation combines the final color of the fragment with the

corresponding pixel’s color value.

Finally, a Frame Buffer write operation replaces the pixel’s color with the new blended

color.

10

Figure 1.6: The finalized fragments are turned into pixels and are written to the frame buffer.

1.1.5 GPU Programmability

During the years, the programmability of the GPU has increased more and more. The third

generation of GPUs introduced the Programmable Vertex Processor, whereas the fourth (and

current) generation of GPUs introduced the Programmable Fragment Processor, and both

processors have even more programmability in each new Shader Model going out to the

market.

The tasks of programming the GPU is also getting easier with the help of newly developed

high-level languages specially designed for GPUs, such as Cg [11].

1.1.5.1 Vertex Processor

The Vertex Processor also known as the Vertex Shader corresponds with the Vertex

Transformation stage of the graphic pipeline. The third generation of GPUs introduced the

Vertex Processor, which gives programmability to the basic transformation and lighting

operations that were only configurable prior to that.

Each vertex’s attributes, such as position, color, normal and texture coordinates, are being

loaded to the Vertex Processor. The Vertex Processor then repeatedly fetches instructions

from the vertex program. The instructions access a set of registers that contain vector values,

11

such as position, normal or color. These registers are read-only, but the results of the

computations can be written to the output registers which are write-only. When the vertex

program terminates, the output registers contain the newly transformed vertex. Some

Intermediate results can be read from or written to a set of temporary registers also available

in the Vertex Processor.

The new Shader Model 3.0 [10] allows the Vertex Processors much more instructions

(65535 instead of only 256), dynamic flow control, geometry instancing and vertex texture

fetch, which allows displacement mapping or vertex texturing.

1.1.5.2 Fragment Processor

The Fragment Processor also known as the Fragment Shader corresponds with the Fragment

Interpolation, Texturing, and Coloring stage of the graphic pipeline. The fourth generation of

GPUs introduced the Fragment Processor, which gives programmability to the basic

interpolation, texturing, and coloring operations that were only configurable prior to that.

Each fragment has parameters that are derived and interpolated from the parameters of the

vertices of that fragment's primitive. These parameters are stored in the input registers, which

are read-only for the fragment program. The Fragment Processor repeatedly fetches

instructions from the fragment program. The instructions access the input registers, and use a

set of temporary registers to calculate intermediate results. The instructions also include

texture fetches, and the textures can be both read from and writing to. The final color, and

optionally the new depth-value of the fragment are stored in the output registers which are

write-only. The fragments can also be discarded by the fragment program.

The new Shader Model 3.0 [10] allows the Fragment Processors much more instructions

(65535 instead of only 96), dynamic flow control and the usage of loops, branches and

subroutines, but almost all of the new features are costly and may decrease performance.

12

1.1.5.3 Cg

The two programmable processors in the GPU require the application programmer to supply

a program for each processor to execute. Cg [11] provides a language and a compiler that can

translate the user’s shading algorithm into a form that the GPU’s hardware can execute.

The Cg (C for graphics) language has high resemblance to C [19] and it follows C’s

philosophy, in that it is a hardware-oriented, general-purpose language, rather than an

application-specific shading language. It supports both of the major 3D graphics APIs:

OpenGL and Direct3D. The general-purposeness of the Cg language is the most interesting

aspect of this language, and it can be used to achieve unconventional goals that a traditional

shading language is not capable to.

1.1.6 Bottlenecks

“A chain is only as strong as its weakest link” – as it does for chains, the same rule applies to

computer hardware too – “The speed of a computer is given by its slowest component”.

When dealing with computer graphics we can refer to three major components – CPU, GPU,

and the communication between them. Finding the bottlenecks between these three

components might help us solve them, thus increasing the overall speed of our computer

graphics applications. Finding the bottlenecks within each of the components might help us

in achieving this goal.

1.1.6.1 CPU

As mentioned previously, the GPUs have already more computational power than CPUs and

the gap between them continues to grow. However, because of the specific nature of the

GPUs, a lot of computer graphic related tasks are still done in the CPU. This is probably the

biggest bottleneck in computer graphics. The increased programmability and general-

13

purposeness of the new GPUs suggest that more and more tasks can be moved from the CPU

to the GPU, and by that help in removing the biggest bottleneck – the CPU.

1.1.6.2 GPU

While clearly the GPU is not the main bottleneck, it can still have smaller bottlenecks in its

inner-components. Detecting and removing those bottlenecks helps increasing the GPUs

speed, and the overall speed of the system. The graphics hardware (GPU) pipeline consists of

five different stages:

Vertex Transformation

Primitive Assembly

Rasterization

Fragment Interpolation, Texturing, and Coloring

Raster Operations

It appears easy to find which of these five stages takes most of the computation time.

When the problematic stage is found its efficiency could be improved by the hardware

manufacturers. Alternatively, the software programmers might design their applications so

that they will try to do less work in this particular stage. However, this task is not as easy as it

seems. Each of these stages has its own hardware component in the GPU, and all the

components run in parallel. Because of that, we can not quite put our finger on a clear

general case bottleneck in the GPU. Nevertheless we can find several bottlenecks regarding

specific applications. For instance, when playing games like Doom3 or HalfLife2, it is very

likely that there will be a very large polygon count on every frame hence the first 2 stages

will be quite stressed. On the other hand, when playing a flight simulator, where the polygon

count is usually reduced but the screen is still filled with fragments, the stages involving

fragments will have probably more work than other stages. The last examples indicate that a

clear general bottleneck can not be found within the GPU.

14

1.1.6.3 CPU/GPU Communication

When a lot of data is being transferred from the CPU to the GPU, the communication

between them easily becomes a bottleneck. This is exactly the reason why Intel introduced

the Accelerated Graphics Port (AGP) back in 1996 [1].

AGP is a high-performance connection between a designated chipset and the graphics

controller used to enhance graphics performance for 3D applications. AGP relieves the

communication bottleneck by adding a dedicated high-speed interface directly between the

chipset and the graphics controller.

Figure 1.7: AGP connection. Courtesy of [1].

AGP uses the main PC memory to hold 3D data sets. Such a scheme allows the AGP to

use “unlimited” amount of texture memory. To speed up the data transfer, Intel designed the

port as a direct path to the PC's main memory, so AGP is in fact a point-to-point connection

between the graphics card, the system memory and the CPU. This enables storing very large

textures in the main system memory instead of on the limited on-board texture memory of

the graphics card.

Because AGP is very fast and enables large amount of data to be stored it is often used as

a caching device for graphic applications. Not only texture data can be stored via AGP, but

15

even geometric data can be cached using the AGP. However, although this caching scheme

allows much better results, the caching is still not on the on-board texture memory. This

means that no matter how fast the AGP connection is, a large amount of data passing on it

might still create a bottleneck.

16

1.2 Level-of-Detail Rendering

A 3D scene that needs to be rendered may contain millions of polygons. A polygon count of

this magnitude passes the polygonal limit that current graphics hardware can render at

reasonable frame rates. With each new generation of hardware this limit grows, but so does

the size of the scenes to be rendered. No matter how fast the graphics hardware abilities

grow, so grows the hunger for bigger models too. A solution to this problem is simplifying

the complexity of the scene to be rendered. This can be done by reducing the number of

polygons sent to the graphics hardware to match the rendering capability.

Many algorithms construct in off-line several levels of detail for each object in scene,

meaning the same object is constructed several times with a different resolution each time. At

run-time one of these levels is chosen based on a few parameters such as view position and

angle. Objects close to the viewer are rendered at high level of detail, while objects far from

the viewer are rendered at low level of detail.

Two main approaches of level of detail (LOD) rendering have been developed – discrete

and continuous.

1.2.1 Discrete Levels of Detail

Discrete levels of detail are obtained by off-line generating a fixed number of distinct levels

of detail for each object. At run-time the most appropriate level of detail is selected for each

object in the scene. The polygons representing the chosen level of detail are sent to the

graphics hardware for rendering.

17

(a) (b)

(c) (d)

Figure 1.8: Discrete levels of detail for the Armadillo model: a) 249,924 triangles, b) 62,480 triangles, c) 7,809 triangles, d) 975 triangles. Courtesy of [5].

18

One way to off-line generate the various levels of detail for an object is to use the vertex

removal technique introduced by Schroeder, Zarge, and Lorensen [25]. Their technique

removes a vertex along with it adjacent triangles, and then triangulates the resulting hole.

Starting from the original representation of the object which is the highest level of detail,

vertices are removed one by one until all requested levels of detail are created for the object.

(a) (b)

Figure 1.9: Vertex removal technique: a) before vertex removal, b) after vertex v is removed.

A detailed description of various discrete levels of detail rendering algorithms is

overviewed by Cignoni, Montani, and Scopigno [6].

These algorithms are suitable for complex scenes that consist of many objects. However,

if an object is highly detailed and viewed from close range, the algorithms will have to

choose a high level of detail representation of that object, implying little or even no

simplification of the scene. If an algorithm chooses a lower level of detail for that close and

highly detailed object, it will result in a poor representation of the object.

Another problem with this kind of algorithms is that adjacent objects can be represented at

different levels of detail. This difference can be often visible to the viewer, thus ruining the

reliability of the entire scene.

These clear drawbacks made the discrete levels of detail rendering algorithms obsolete.

19

1.2.2 Continuous Levels of Detail

Continuous levels of detail rendering algorithms are designed to deal with the problems that

the previous discrete algorithms introduced. These algorithms allow various levels of detail

to co-exist along different regions of the same object.

The changes in the rendered model between frames are very subtle due to the continuous

nature of these algorithms. Also, the simplification operator has to be dual insuring that a

model that is continuously simplified could be also continuously refined until it reaches its

original geometrical structure.

Hoppe [14] has introduced the progressive meshes scheme that uses the edge collapse

operation for continuously simplifying the model. The edge collapse operation unites a

chosen pair of adjacent vertices to one new vertex, thus removing the edge between them.

The dual operation for the edge collapse is vertex split. A vertex split separates a vertex into

two vertices and inserts back the edge that was originally between them.

Figure 1.10: Edge collapse and vertex split operations.

These two dual operators enable continuous change in the level of detail of the model.

Whenever simplification of the model is wanted a line of edge collapse operations take place,

and when the model needs to be refined back these edges are restored in reverse order by

matching vertex split operations. The collapsing edges are chosen carefully to avoid edge

collapse operations that generate geometric errors such as foldovers.

20

1.2.3 View-Dependent Rendering

In the previous section, we have shown a scheme that achieves continuous LOD rendering.

However, we did not explain how to choose the regions that are to be simplified. View-

dependent rendering algorithms choose the appropriate level of detail of each region in the

model with respect to view parameters at real time. Most of these algorithms rely on an off-

line construction of the continuous levels of detail. At run-time, an adaptive level is selected

for each region according to some or all of the following parameters:

Distance – the distance between the object and the viewer. Regions close to the

viewer are represented by a higher resolution than those farther from the viewer.

Visibility – back-facing polygons and polygons that are outside the view frustum

have a very coarse representation.

Illumination – illuminated regions are more detailed than regions in shadow.

Silhouette – vertices on the silhouette are very important for the reliability of an

image. Therefore, the silhouette is represented with a very high resolution.

Screen-space projection – objects that contribute only a few pixels to the final image

are represented in a much lower resolution than objects that cover most of the image.

Figure 1.11: A view-dependent representation of a sphere. Its adaptive levels of detail are based on the distance, visibility, and silhouette parameters. Courtesy of [26].

21

These parameters usually do not change drastically between consecutive frames therefore

view-dependent rendering implies significant coherence. A lot of view-dependent rendering

algorithms use this coherence to calculate only the changes between frames, and avoid

recalculating the levels of detail of the entire model in each frame.

1.2.4 Vertex Hierarchy

As mentioned in the previous section, view-dependent rendering algorithms usually rely on

an off-line construction of the levels of detail. Most of these algorithms use a hierarchical

data structure called vertex hierarchy. The vertex hierarchy is actually a tree of vertices. The

vertices close to the root correspond to a low detailed representation, while vertices further

away from the root represent a higher level of detail. The leaves of the vertex hierarchy are

the vertices of the original model. At run-time a cut of the vertex hierarchy defines the

vertices that will be rendered at each frame. These vertices are called the active nodes.

Figure 1.12: Active nodes in a vertex hierarchy.

View-dependence trees presented by El-Sana and Varshney [9] are a good example for

vertex hierarchy. Usually several trees should be used to represent a complex model, where

22

each tree represents a single object or a group of objects. A view-dependence tree is

constructed bottom-up by recursively applying the edge collapse operation starting from the

unsimplified original representation of the object.

Figure 1.13: Edge collapse and vertex split operations on view-dependence trees for the mesh in figure 1.10.

At run-time the active nodes of the tree are chosen. A list of active vertices that are

represented by the chosen active nodes is sent to the graphics hardware for rendering. Along

with the active vertices, a list of active triangles is also sent to the graphics hardware to create

the image. Due to the coherent nature of this algorithm, the active nodes are not recalculated

every frame. Instead, the algorithm strolls along the active nodes cut of the previous frame.

For each active node it decides whether it needs to be simplified or refined based on the

viewing parameters introduced in the previous section. If a simplification is required in the

level of detail of a node, then its parent is added to the active nodes. On the over hand, if a

refinement is needed, then the node’s children are added to the active nodes.

1.2.5 Terrain Rendering

Terrain models are usually bigger and more complex than other types of models. Their size

forces them to be aggressively simplified in order to reach interactive frame rates when

rendering them. The advantage of terrain models is that they are not full 3D models and

sometimes are referred to as 2.5D models. The reason for that is because a terrain can be

23

represented as a 2D elevation map. This unique representation enables the usage of special

simplification algorithms that can not be used on a general 3D model.

One such algorithm is the ROAM (Real-time Optimally Adapting Meshes) algorithm

presented by Duchaineau et al. [7]. ROAM treats a terrain model as a rectilinear elevation

map. The algorithm off-line builds a triangle binary tree data structure, in which each node

represents a triangle on the elevation map. The two children of a triangle in the binary tree

are formed by splitting the triangle in its base edge.

At run-time the preprocessed binary tree is used to build the adaptive triangle mesh for

each frame. The ROAM algorithm takes the distance of a region from the view point as a

parameter for defining the level of detail that this region will be presented in. It also uses the

planarity of the region as a parameter. A flat region will be represented more coarsely than a

region with a rough neighborhood.

Figure 1.14: ROAM terrain. Courtesy of [7].

24

1.2.6 Cluster Hierarchy

Level-of-detail rendering algorithms often fail to select the appropriate level of detail for

very large datasets within the span of one frame. Such limitations occur due to the often

overloaded CPU, which becomes the main bottleneck with traditional level-of-detail

rendering algorithms. To overcome this problem researchers have developed aggressive

refinement operators based on cluster hierarchy.

The Quick-VDR algorithm introduced by Yoon et al. [30] represents the dataset as a

clustered hierarchy of progressive meshes. The cluster hierarchy is used for coarse-grained

selective refinement, whereas fine-grained local refinement is obtained using progressive

meshes [14]. Using cluster hierarchy compared to vertex hierarchy reduces the refinement

cost for view-dependent rendering by more than an order of magnitude.

Figure 1.15: The right inset image shows clusters in color from a 64K cluster decomposition of Michelangelo’s St. Matthew model. Courtesy of [30].

25

Chapter 2

Related Work

Level-of-detail rendering algorithms appeared in the 1990s. Because of the very limited

abilities of graphics hardware back then, all these algorithms do their entire work on the CPU

and treat the hardware as some kind of a “black box”. These algorithms are based on the

obsolete fact that graphics hardware is unprogrammable, and therefore their most important

task is to try and reduce the geometry sent to the hardware. That is the reason why there are

very few works that combine level-of-detail rendering with hardware issues.

In the late 1990s, after the AGP [1] was introduced by Intel, the possibility of geometry

caching appeared. A few algorithms that combine level-of-detail rendering with geometry

caching using the AGP have been introduced since then. In 2004, after vertex texture fetches

were enabled, displacement mapping in the vertex shaders became practical. Because this

ability is so new, very little work has been done to combine displacement mapping with LOD

rendering. We will review the limited number of algorithms that combine level-of-detail

rendering with either geometry caching or displacement mapping.

26

2.1 Geometry Caching

In the aspect of geometry caching there is little work, due to the restrictedness of the

hardware until recently. Upon the introduction of AGP, the ability to cache geometry became

realistic. We will review a work that covers techniques for accelerating real-time graphics.

This will determine the effectiveness of caching using AGP. Next, we will see three

strategies to manage the cached geometry, and finally we will review an algorithm for level-

of-detail rendering that harnesses the ability of geometry caching.

2.1.1 Accelerating Real-Time Graphics

Perrson [24] has reviewed in his work several accelerating methods for real-time graphics.

Vertex Arrays is the first method introduced in this work. A block of vertex data (vertex

coordinates, texture coordinates, normals, RGBα colors, color indices, and edge flags) may

be stored in an array and then used to specify multiple geometric primitives through the

execution of a single OpenGL command. This simple method proves to be up to 3 times

faster than the immediate mode that uses the glBegin and glEnd functions. Several OpenGL

commands can be preprocessed and packed together into a Display List in order to achieve

improved efficiency. However, when packing Vertex Arrays into Display Lists the overall

efficiency hardly improves, and it even declines in some cases due to the small overhead

implied when using Display Lists.

Another acceleration method is the VBO (Vertex Buffer Object) extension introduced by

NVIDIA. These Vertex Buffer Objects use high-performance graphics memory, mainly

AGP, instead of standard memory that uses the regular bus. VBO actually caches the vertices

on the AGP, and by doing that it does not only lower the memory operations for every frame,

but it also uses a faster bus (AGP) to transfer the data.

The tests appearing in Perrson’s [24] work show that VBO performs up to 15 times faster

than the immediate mode, and about 5 times faster than the Vertex Array method.

27

Figure 2.1: Megavertices rendered per second with immediate mode, VA, and VBO. Courtesy of [24]

2.1.2 Cached Geometry Manager

Caching geometric data using the AGP is as we have seen doable and effective. However,

sometimes the geometric data is too large to fit into the size of the fast access memory. For

that reason Lario, Pajarola, and Tirado [16] have introduced the Cached Geometry Manager

(CGM). New vertices are needed to be displayed whenever the level of detail of some part of

the model changes. Some of these vertices might not be already cached, and therefore they

should be transferred from main memory. A removal operation is needed to free up space for

new vertices when AGP memory is full. Inactive but cached vertices are the prime candidates

to be removed, and three strategies are given to choose the best vertices for removal.

28

http://wscg.zcu.cz/wscg2005/Papers_2005/Full/D07-full.pdf

The first and simplest strategy is First-Available (FA). The cached memory is used as a

linear list of slots, and when a slot is needed the first available slot is chosen. An available

slot is a slot that stores a vertex that was not used in the current or last frame. The problem

with this strategy is that it does not consider the number of frames the vertex was not used.

The second strategy handles an ordinary Least-Recently-Used (LRU) list, where the

vertex chosen to be removed from cache is the one that was used least recently among all the

vertices stored in cache at the moment.

The third strategy (LRU + Error-PriorityQueue) uses the hierarchical nature of vertices in

level-of-detail algorithms. Statistics show that vertices belonging to a coarse level of detail

have more chance to be used than vertices belong to a finer level of detail. This fact is used to

maintain a Priority-Queue instead of the last 10% of the LRU list. When needed, the queue

takes the least recently used vertex from the list. From all the vertices in the priority queue,

the one with the greatest priority, representing the finest level of detail, is chosen for

removal.

The three strategies were tested on two different view-dependent level-of-detail

frameworks. The FastMesh [22] framework was used for arbitrary meshes, while QuadTIN

[23] was used to achieve better results with terrain models.

2.1.3 Cached Aggregated Binary Triangle Trees

Caching data is always possible, but if large portions of the data remain constant through

several frames, the caching becomes much more effective. When using VBO on a regular

level-of-detail terrain algorithm, we find improved results due to the caching of the vertices.

Nevertheless, we still have continuous changes to the list of displayed vertices, and therefore

we do not fully harness the clear advantages of caching.

Levenberg [17] introduced the Cached Aggregated Binary Triangle Trees (CABTT)

algorithm for that reason. The CABTT algorithm uses the same binary tree data structure as

the ROAM [7] algorithm. The difference is that instead of using a single triangle in each

node of the binary tree, CABTT uses a cluster of geometry called aggregate triangle. CABTT

29

uses a fixed triangulation for every cluster or aggregate triangle. Each cluster edge is divided

to 2N segments. Having this number of segments per cluster edge insure that no T-junctions

appear when aggregate triangles of different levels are adjacent. The best results were found

when each cluster edge was divided to 16 segments. This particular triangulation yields 206

triangles per segment.

Figure 2.2: A 2049 X 2049 height field rendered with CABTT. An example of an aggregate triangle is highlighted. Courtesy of [17].

Because each triangle is replaced by a fixed triangulation of 206 triangles, the binary

triangle tree becomes much shallower, hence less work is done on the CPU for a given level

of detail. This of course comes with a little cost in the precision of the triangles in each

cluster. When there is a change in the level of detail, instead of individual triangles changing,

there are clusters of triangles moving out of cached memory and different clusters replacing

them. This scheme enables very good caching using the AGP. The cached geometry manager

shown before can be used to handle the caching.

30

2.2 Displacement Mapping with Various Levels of Detail

Displacement mapping [28] is a technique that modifies the vertices of an object so that

during the rendering process, the object's geometry is altered to create a bumpy surface.

Unlike regular bump mapping, the edges are actually raised and can cast shadows. The

roughness of a 2 dimensional texture is used to adjust the degree to which the geometry of

the object is displaced. Displacement mapping only alters the object's geometry in the

rendered image and not the scene, so highly complex objects can be created without having

to actually model them.

(a)

(b)

)c(

Figure 2.3: Example of displacement mapping: a) original mesh, b) displacement map,c) mesh with displacement. Courtesy of [28].

31

For years displacement mapping was a peculiarity of high end rendering systems, while

real-time APIs, like OpenGL and DirectX lacked this possibility. One of the reasons for this

absence is that the original implementation of the displacement mapping required an adaptive

tessellation of the surface in order to obtain micro-polygons whose size matched the size of a

pixel on the screen.

With the newest generation of graphics hardware displacement mapping can be

interpreted as vertex texture mapping, where the values of the texture map do not alter the

pixel color, but change the position of the vertex instead. Displacement mapping can in this

way produce a genuine rough surface. It has to be used in conjunction with adaptive

tessellation techniques that increase the number of rendered polygons according to the

current viewing settings. This is done in order to produce highly detailed meshes, and to give

a more 3D feel and a greater sense of depth and detail to textures that displacement mapping

is applied to.

By using these tessellation techniques various levels of detail could be achieved for the

displacement maps. The great latency of fetching vertex textures in current hardware

suggests that the use of displacement maps for level-of-detail rendering is not a very efficient

scheme. However, the fact that when using displacement mapping almost all CPU

computations can be avoided implies that the overall effectiveness will not decrease, and the

results will even improve. Another clear advantage of using displacement mapping is that

highly complex objects can be created without actually being modeled. Therefore, we

conclude that the combination of displacement maps with level-of-detail rendering will

become more and more popular

Taking into account that the latency of fetching vertex textures should decrease

dramatically with every new graphics card that enters the market, we understand that this

scheme will be even more useful in the near future. Moreover, the height maps used for the

displacement mapping can be easily compressed and decompressed, so huge terrains can be

rendered with real-time results, without using special out-of-core techniques.

Because the displacement mapping is a very new feature for real-time APIs, only one

algorithm that combines it with level-of-detail rendering was published up to now. We will

32

http://en.wikipedia.org/wiki/Micropolygon

http://en.wikipedia.org/wiki/DirectX

http://en.wikipedia.org/wiki/OpenGL

review a prior version of this algorithm that is still very CPU focused, and its second version

which is much more GPU-based.

2.2.1 Geometry Clipmaps

The geometry clipmap framework introduced by Losasso and Hoppe [18] treats the terrain as

a 2D height map, pre-filtering it into a pyramid of m levels. The levels represent nested grids

centered over the viewer at successive power-of-two resolutions. Starting from a square that

represents the closest level to the view-point, each level is represented by a hollow ring

wrapping the smaller square of its preceding level. This actually creates view-dependant

levels-of-detail, where the finer levels-of-detail are in the grids closest to the camera.

Figure 2.4: Clipmap levels of a terrain rendered using a coarse geometry clipmap. Courtesy of [18].

The grids are stored as vertex buffers in AGP memory, and are incrementally refilled as

the viewpoint moves. For each level only a thin L-shaped region is changed per frame,

therefore there is great coherency and the caching is effective. To prevent cracks and popping

effects along the boundaries of different levels, zero-area triangles and transition regions are

added respectively.

33

The height maps used by the algorithm are calculated in the CPU rather than in the much

faster GPU. Nevertheless, the algorithm provides good rendering rate, and it is simple and

coherent. The height maps also have a coherent nature and therefore enable compression of

up to 100 factor of the base dataset.

2.2.2 GPU-Based Geometry Clipmaps

The geometry clipmaps algorithm uses height maps, but calculates them in the CPU and

sends them to the GPU as vertex buffers. With the new developments in graphics hardware,

the height maps can be easily moved to the GPU and perform as displacement maps in the

vertex processor.

Asirvatham and Hoppe [2] presented a GPU-based implementation of geometry clipmaps

that uses the new vertex texture fetch ability to move the calculation of the height maps to the

GPU. In their GPU-based implementation, the height maps perform as displacement maps in

the vertex processor. Consequently, constant vertex and index buffers are sent by the CPU,

therefore relieving the CPU from most of its workload.

In the original implementation each level was represented by a hollow ring. To take full

advantage of the GPU’s computational power, this version of the algorithm divides each ring

to 12 smaller square blocks. The small squares reduce memory costs, because displacement

mapping works with full squares. If the entire clipmap was used as a displacement map, then

the inner part of the ring would have been passed to the GPU along with the rest of the ring.

In addition, the small square blocks also enable view frustum culling.

34

http://wscg.zcu.cz/wscg2005/Papers_2005/Full/D07-full.pdf

Figure 2.5: Top view of a terrain, showing each nested grid is composed of 12 square blocks.Courtesy of [2].

Figure 2.6: View frustum culling. Courtesy of [2].

35

Chapter 3

Motivation

As the title of the thesis states, the goal of this work is to introduce several approaches to

level-of-detail rendering. These approaches should leverage the great computational power of

the GPU, and overcome the limitations of the system.

In this chapter we will first have a glance at how a GPU-based level-of-detail rendering

scheme could look like in an ideal world, meaning a world where a GPU can do everything

that a CPU does, but much faster of course. Next, we will see how the architecture of the

GPU prevents reaching this utopian goal. Finally, we will show what the GPU enables us to

achieve despite these limitations.

The overall motivation of our work is derived from these comprehensions – We see what

we want to achieve, we understand why we can not fully achieve it, and we set our goals to

try using all the relevant features that the current GPUs do allow us.

3.1 Utopia

Consider the possibility that a GPU is capable of doing anything a CPU can, and on top of

that it does everything faster. If this was the case, then for each frame the CPU would only

have to send the new position and angle of the camera to the GPU. The GPU, being so

versatile, could easily get these changed details of the camera and calculate the new levels of

36

detail by itself. The GPU could run any level-of-detail rendering algorithm designed for

CPU, but just much faster. All the geometric data will be kept in the internal memory of the

GPU, or in the worst case only a part of it will be cached on the GPU, depending on the size

of the GPU’s internal memory. Having minimal or even no geometrical data at all

transmitted to the GPU each frame, the overall efficiency of the level-of-detail rendering

algorithms will grow by orders of magnitude.

3.2 Limitations

A GPU is not a CPU - It is a processing unit designated for rendering images upon receiving

a stream of vertices. Due to its unique nature, the GPU must get a list of vertices as its input.

We can send the GPU a list of “blank” vertices on which it will run its LOD algorithm, but

some problems might show up when doing that.

If we look back at the important operations that should be done in a level-of-detail

algorithm, we can detect two problematic issues – changing number of vertices and

neighboring geometry.

A level-of-detail algorithm changes the number of vertices – This is actually the main goal

of LOD algorithms. In order to have an LOD algorithm run on the GPU, the GPU must

generate or discard vertices to reach the desired level of detail. Current vertex shaders do not

posses these abilities. These problems can actually be bypassed, but with a big cost:

In order to discard a vertex we can set its position to be the same as one of its

neighboring vertices. This way a degenerated polygon will be created, and the vertex

will not be seen.

We can bypass the generation of new vertices, by starting with the maximal number

of vertices that theoretically could be chosen by the LOD algorithm, meaning the

highest level of detail.

Both these bypassing solutions are pretty wasteful – The discarding of a vertex does not

actually discard the vertex, and the vertex actually goes through the entire GPU pipeline,

37

relieving nothing of the GPUs workload. Furthermore, the degenerated polygons created

when using this solution might obstruct the rendering process and create abnormalities in the

resulting image. The solution for creating vertices is also problematic because it forces the

entire model to start with the highest level of detail which is very time-costly and completely

denies the possibility of coherency.

In order to decide on the best level of detail, the GPU must have information about

neighboring geometry. Current vertex shaders do not have any information about their

neighboring geometry. This kind of information can be theoretically stored in texture

memory, so the vertex shader could calculate the level of detail of its vertex using this data.

In this case the vertex shader will be forced to perform numerous accesses to the texture

memory for each vertex. The problem is that fetching data from the texture memory is an

operation that comes with great latency on current hardware, not to mention numerous

operations of this sort for each vertex.

Taking into account all these limitations, we understand that full level-of-detail rendering

within the GPU can not be achieved unless significant structural and conceptual changes are

made to the graphics hardware.

3.3 Reality

We have seen that the designated nature of the GPU prevents us from achieving level-of-

detail rendering implemented solely on the GPU side. However, we can try and move more

work to the GPU within its restrictions. The new Shader Model 3.0 [10] introduces some

major improvements to both vertex and fragment shaders. One of the new features introduced

in the vertex shader is the ability of vertex texture fetches [12]. This is the most interesting

feature, because it allows the vertex processor to use memory in the shape of texture

memory.

We use this new feature to cache geometric data on the GPU on-board texture memory.

Geometric data is usually smaller than texture data, so geometric data caching is actually a

38

realistic goal considering the new texture fetch ability and the growing size of texture

memory in current graphics hardware.

We use the vertex texture fetching ability to cache geometric data, but this is not what this

feature was meant to do in the first place. The reason graphics cards manufacturers enabled

this feature was to support displacement mapping in the GPU. We will introduce an

algorithm that uses displacement mapping to achieve level of detail rendering with almost all

the work done within the GPU.

The caching scheme along with the displacement mapping algorithm are two different

approaches with one common goal - leveraging the computational power of the GPU to

improve the performance of level-of-detail rendering.

39

Chapter 4

Caching Data on the GPU On-Board Texture Memory

Current level-of-detail rendering frameworks that use geometry caching have very little

control on how the data is really cached. These algorithms use extensions given by the APIs,

such as VBO, to cache their data. However, these algorithms do not have control on how the

data is physically cached, and usually these API extensions cache the data on AGP memory

rather than on-board graphics memory. This limited caching ability although significantly

improving performance does not fully cache the data, because the geometry still has to be

passed at each frame from AGP memory to the GPU.

As seen in the related works section, some algorithms use these extensions to achieve

better results for level-of-detail rendering and some even try to manage the cached data

within the restrictions of the extensions.

The newest generation of graphics hardware allows fetching vertex textures [12] in the

vertex processor. While this ability was originally planned for enabling displacement

mapping in the vertex shader, it actually turned the on-board texture memory to a general-

purpose memory that the vertex shader can use. We will describe a scheme that uses this

memory to cache geometric data on the GPU itself, while maintaining full control of the

caching management.

40

4.1 Storing Geometric Data in Textures

The geometric data sent from the CPU to the GPU for each vertex contains its position

coordinates (x,y,z) and usually its color (RGBα). Some applications may also need the

normal information of the vertex (x,y,z) and possibly some texture coordinates (x,y) for the

texture mapping stage later in the fragment shader. The position coordinates consist of three

floats, the color is a single float (each element is a byte size), normals are three floats, and

texture coordinates are two floats.

If we sum up all of the above data, a graphics application may need to send up to 9 floats

per vertex from the CPU to the GPU. Our technique cuts the amount of data sent to the GPU

by up to 75%.

To achieve that, we store all the needed data for a vertex in a texture. This texture consists

of the geometric data of all the vertices. The level-of-detail algorithm running on the CPU

stores for each vertex some additional data. This additional data consists of two vertex

texture coordinates that map the vertex to the place in the vertex texture where its data is

stored. When the level-of-detail algorithm decides on a vertex to be sent to the GPU, it sends

the two vertex texture coordinates instead of its position x,y coordinates. The z coordinate

can be ignored by sending 2D vertices, or can alternately contain any other kind of data. The

GPU’s vertex shader fetches all the data belonging to this vertex from the texture upon the

arrival of the two vertex texture coordinates. We show several storage scenarios, so every

application can use its optimal scenario depending on the number of floats it has to send for

each vertex.

41

4.1.1 4 Floats Scenario

We first show the simplest scenario where our technique halves the data size sent each frame.

We consider a model with just the three position coordinates and a color for each vertex.

Without our caching technique, for each vertex chosen by the level-of-detail algorithm 4

floats of data are sent to the GPU. Instead, we store all these 4 floats of data for each vertex

in the texture, and just send two floats per vertex, one float for each vertex texture

coordinate.

The vertex shader can fetch from the texture up to 4 floats at a time, therefore for each

vertex, all the data can be retrieved from the texture with a single fetch operation. This is

extremely important due to the great latency that vertex texture fetches imply in current

graphics hardware.

The 4 floats scenario implies that normals can not be used. Without normals, real-time

lighting calculation is not possible. However, pre-calculated lighting can be used in the case

that the light source is static throughout the scene.

Figure 4.1: 4 floats scenario without texture caching.

42

Figure 4.2: 4 floats scenario with texture caching.


In a different scenario our technique can reduce the data size sent from the CPU to the GPU

by 75% percent. We consider a model with three position coordinates, three normal

directions and two texture coordinates for the fragment shader. Without our caching

technique, for each vertex chosen by the level-of-detail algorithm 8 floats of data are sent to

the GPU. Instead, we store these 8 floats of data for each vertex in two groups of 4 floats that

are kept in two different textures, but in the same coordinates so just two floats are sent per

vertex, one float for each vertex texture coordinate.

The lighting problem with the 4 floats scenario is not relevant in this scenario, because the

normal data for each vertex is cached in the vertex textures. The vertex shader can calculate

the lighting at real-time using this data.

Because we use two textures instead of a single texture as in the 4 floats scenario, we need

two texture fetches for each vertex with this scenario. Some applications may need several

sets of texture coordinates for the fragment shader, or any other additional data. With our

technique we can store 12, 16 or more floats for each vertex to support these applications.

43

Mind that the more floats needed for each vertex the more our technique reduces the

communication between the CPU and the GPU.



44


Due to the great latency that fetching vertex textures imply, we prefer storing multiples of 4

floats for each vertex.

Consider the scenario introduced in the beginning of the section - three position

coordinates, three normal directions, two texture coordinates and a color together sum to 9

floats per vertex, which is not a multiple of 4 floats. In this case we can send one of the 9

floats as the z coordinate instead of omitting the z coordinate, so in the textures we store only

8 floats for each vertex. Now the CPU sends 3D vertices, where the x,y coordinates of each

vertex are the vertex texture coordinates, and the z coordinate stores one of the floats, for

example the color information. By sending 3 floats instead of 9 floats per vertex, we reduce

the CPU/GPU communication by two thirds in this scenario.


45


4.2 Caching Management

We have seen how the geometric data is being stored in the on-board texture memory, but we

have not shown yet how it actually gets there. By turning the texture memory into a cache

memory, we must take the responsibility of managing it.

The maximal texture in current graphics hardware contains 4096 X 4096 pixels. Each of

these pixels can be represented in the form of RGBα, where each color component is a float.

Therefore each pixel can consist of 4 floats, which is exactly the storage size we need for

each vertex when using our texture caching technique. Having 4K X 4K pixels means we

have storage place for 224 vertices, which is over 16 million vertices. With the 8 floats

scenario we use two different textures, therefore this limit stays the same and it does not

additionally harshen.

46

If we refer to the 8 floats scenario for a static model, the two textures containing the

vertices of the entire model are created at preprocessing. Before the rendering begins, they

are uploaded once to the texture memory, and from that point no additional management is

required. However, when the model has more vertices than the capacity of a texture or when

the model is dynamic, some extra management is required in the shape of vertex switching in

the texture. Even when the vertices of the entire model can be stored in a texture, there still

might be a need for vertex switching in the textures, because the total size of texture memory

is limited too and the application might want to use the texture memory to store color

textures for texture mapping in the fragment processor. The texture we use for caching in the

4 floats scenario may contain up to 224 vertices with 4 floats per vertex. If we sum it up we

get exactly 256 Megabytes of memory, which is the total size of texture memory in some

current graphics hardware. In this case, there is no texture memory left for other textures.

This problem might be solved by the fact that the newest graphics hardware available today

has double the amount of texture memory – up to 512 Megabytes, but then again the 8 floats

scenario leads to the same problem. Moreover, the need for other textures may leave us no

possibility but to use a smaller caching texture. That again leads to the conclusion that

caching management is a problem that can not be ignored.

We can use any of the caching strategies introduced in the Cached Geometry Manager

(CGM) [16]. For instance, LRU is a strategy that achieves good results with almost every

framework. Whenever the CGM wishes to cache a new vertex, it just chooses the vertex in

the texture to be removed according to the LRU strategy, and places the data of the new

vertex instead of the data of the obsolete vertex. This can be done easily because with the

texture caching technique we have full control of the memory.

47

4.1 Level-of-Detail Rendering using Cached Geometry

A level-of-detail rendering algorithm usually holds a vertex hierarchy data structure that

contains about twice the amount of vertices as in the original model. On the first hand, this

double amount of vertices might fit entirely in the caching textures. This is the simple case of

uploading the textures once before the rendering begins. On the other hand, there might not

be enough texture memory available for the double amount of vertices. This can be the case

when the doubled number of vertices is combined with any other reason for insufficient

texture memory as described in the previous section.

Using a level-of-detail framework that supports geometry caching is the best solution for

these cases, when there is insufficient texture memory. The CABTT algorithm [17] is a good

example of such a framework. The clusters of triangles created in the CABTT algorithm can

be cached effectively with fewer changes in the caching textures than as with a framework

that is not designed to cache geometry. The CABTT can run a CGM [16] to decide which

cluster to replace whenever a new cluster should be cached. The best CGM strategy in this

case is the LRU + Error-PriorityQueue strategy.

Figure 4.7: Data structure for the LRU + Error-PriorityQueue strategy. Courtesy of [16].

48

This particular CGM strategy uses the fact that vertices belonging to a coarse level-of-

detail have more chance to be displayed than vertices belonging to a finer level-of-detail.

This strategy uses a priority queue instead of the last 10% of the LRU’s list. When a new

cluster is needed, the queue takes the least recently used cluster from the list. From the

clusters in the queue the one with the greatest priority, meaning the finest level-of-detail, is

chosen for removal.

4.2 Optimizations

4.2.1 Triangle Strips A triangle strip is a series of connected triangles therefore the application does not have to

repeatedly specify all three vertices for each triangle. Alternatively, it can use the fact that

every pair of connected triangles shares two vertex references to reduce the overall number

of references.

Figure 4.8: A triangulation to be triangle stripped.

A

B

C

D F

G

E

49

For example, the above triangulation consists of 5 triangles, so without the use of triangle

strips it is represented in the following way:

ABCBCDCDEDEFEFG

Each triangle is represented by 3 vertex references, so 15 references are needed to

represent this triangulation without triangle strips. Contrarily, only seven vertex references

are needed to define the triangle strip of these exact 5 triangles:

ABCDEFG

A model that is triangle-striped therefore uses less memory. Triangle strips are also API

and hardware supported hence processing time also improves when using them. Because of

these reasons most objects in current 3D scenes are composed of triangle strips.

The biggest problem however with triangle strips is their creation. The creation of triangle

strips from an arbitrary mesh is an NP-complete problem. Therefore, a heuristic algorithm is

needed to create the triangle strips in reasonable times. We can use Terdiman’s [27] stripifier

to create efficient triangle strips.

While this pre-processing striping works fine for static meshes, it is irrelevant for dynamic

meshes whose topology changes over time. Such meshes are used in level-of-detail rendering

schemes therefore a run-time striping solution is needed. Skip Strips [8] provide a solution to

this problem by updating triangle strips on-the-fly for dynamic meshes.

As we see, triangle strips offer great optimization for any 3D rendering scheme, and can

therefore be used with our algorithm. Furthermore, triangle strips optimize our algorithm

even more than other algorithms, because the biggest setback in our texture caching

algorithm is the great latency of fetching vertex texture in current hardware. Triangle strips

reduce the number of vertex references by up to almost three times, leading to almost three

times less vertex texture fetches.

50

4.2.2 Geometry Instancing

Geometry instancing is a scheme for efficiently rendering the same object multiple times

with only small differences such as position, color and orientation.

The Sanjusangendo temple in eastern Kyoto, Japan is a good example for the need of

geometry instancing. This particular temple has in it 1001 virtually identical budha statues of

Kannon, the goddess of mercy. In such a case, there is no need to store a model that consists

of 1001 identical statues. Alternately, a model of a single statue can be instanced a 1001

times.

Figure 4.9: The Sanjusangendo temple in eastern Kyoto, Japan.

51

Display lists can be used to instance the same object several times, each in a different

world-space position. However, all the instances have the exact same color, since display

lists use identical commands for each instance with only the world-space projection

changing.

Our texture caching algorithm can also be used as some kind of geometry instancing

scheme. Using our algorithm, once an object is cached in the vertex textures, it can be

instanced multiple times. This can be achieved if each vertex will have an additional instance

index parameter.

If a certain object is to be instanced several times, we could send along with each of its

vertices an additional parameter, the instance index. This is actually a case of the 9 floats

scenario of our algorithm. Let us say the vertex texture holds 8 floats for each vertex

(position, normals and texture coordinates for the fragment shader). Now, instead of only

sending the two vertex coordinates we also add the z coordinate that stores the instance index

parameter. This parameter lets the vertex shader know to which instance this particular vertex

belongs. According to this data, the vertex shader can change for example the position or the

color of the vertex. If the entire model to be rendered consists solely of multiple instances of

the same object, then the vertex processor can handle all the instancing by itself.

For example, several copies of the same object can be obtained by adding a multiplication

of the instance index to the position of each vertex. This way, a row of instances of the same

object would be displayed.

If the instances of the object do not represent the entire model, or if any instance needs

changes that can not be derived from the instance index parameter, then the Vertex Constants

Instancing method introduced by Carucci [4] can be used. This method uses the vertex

constants available in current vertex shaders to store the instancing data. The problem with

this method is that the number of vertex constants is limited in current graphics hardware to

256. Therefore, the number of instances of the object is limited too.

This method can be used along with our texture caching technique. The data shared by all

the instances can be cached in the vertex textures, while the unique data for each instance can

be stored in the vertex constants. Both types of data will be derived at run-time by the vertex

shader, while the CPU does very little work in this process.

52

Figure 4.10: Texture caching combined with vertex constants instancing.

In the example in figure 4.10, the texture caches the data as a regular 8 floats scenario,

while the vertex constants hold the X,Y offsets for each instance, and its color. The final

position of each vertex is achieved by adding the X,Y offsets in the vertex shader to the

values of the X,Y position coordinates that were derived from the texture. The combination

of texture caching with vertex constants instancing helps achieving load balancing between

the CPU and GPU, and fights the biggest bottleneck – the CPU.

53

4.2.3 Short Coordinates

Each vertex texture coordinate sent from the CPU to the GPU with our texture caching

algorithm can hold only up to 4096 different values, which is the limit on texture size in

current graphics hardware. The meaning of this limit is that we only need 12 bits to hold a

texture coordinate data, but we actually send a float of 32 bits for each coordinate.

Alternately we can use a short instead of a float for every coordinate, and halve the data sent

for each vertex. A short has 16 bits of data, so possible future texture size growth of up to 16

times per axis will still enable the use of a short. By using short coordinates, we only need 2

shorts instead of 8 floats in the 8 floats scenario, therefore reducing the data size sent for each

vertex by 87.5%.

54

4.3 Implementation

We have implemented our algorithm in C++ with OpenGL as our API, and Cg as the

programming language for the GPU shaders. All the geometry is sent using the VBO

extension of the API. This way, the results of our algorithm can be compared to the fastest

rendering scheme available.

The code for the vertex shader in the 4 floats scenario:

Listing 4.1: Vertex shader code for the 4 floats scenario.

55

The code for the vertex shader in the 8 floats scenario:

Listing 4.2: Vertex shader code for the 8 floats scenario.

56

Our fragment shaders in all the scenarios just set the color of the fragment according to

the data received from the vertex shader. The reason we implemented them is to prevent any

additional work that might be done when using the default fragment shaders.

All the optimizations introduced in the previous section were implemented. The

implementations of the triangle strips and the short coordinates do not change the code of the

vertex shader. However, to use the geometry instancing optimization, an addition to the

vertex shader (listing 4.3) must be added. In order to change the X position of each instance,

as for example in one line of budhas in the Sanjusangendo temple, the code for calculating

the position should be changed:

// Compute the position, adding the number of instance to the X positionv_out.oPosition = mul(modelViewProj, fetchedVec1 + float4(instance, 0, 0, 0));

Listing 4.3: Addition of geometry instancing to the vertex shader code.

All the scenarios were implemented in full-mesh rendering, with no level-of-detail

rendering. The reason for this was to check the caching scheme itself, without it being

effected by the implementation of any level-of-detail rendering scheme. Nevertheless, our

caching scheme can be used with any level-of-detail rendering due to its generalized nature.

Results concerning all the scenarios and optimizations are shown in the next section.

57

4.4 Results

We have tested our implementation on a Pentium-IV 3Ghz 1GB RAM machine with an

Nvidia Geforce 6600 GT (128 MB texture memory) graphics card. We compare the results of

our algorithm with a fully-optimized rendering scheme using the VBO extension, which

caches the data on AGP memory. We chose to compare our algorithm with VBO, because it

is currently the fastest way of rendering, and we want to compare our scheme to the fastest

one available to get a trustworthy comparison. Moreover, our algorithm runs on top of the

VBO extension anyway, so it would be fraudulent to compare it with anything but VBO.

We compare the results of our texture caching algorithm (TC) to the results of the VBO

extension for both the 4 floats and 8 floats scenarios. All these tests are made on three

different models of various sizes. The shark model is the smallest model tested, the budha

model is over 12 times bigger than the shark, and the teeth model is about 4 times bigger than

the budha.

Model Vertices Triangles Frames Per Second (FPS)

TC-4 TC-8 VBO-4 VBO-8

Shark 2,560 5,116 100 100 100 100

Budha 32,328 67,240 100 50 100 100

Teeth 131,685 263,350 33.38 16.68 100 50

Table 4.1: Results of the Texture Caching and VBO schemes in 4 floats and 8 floats scenarios for various models.

We can see that for small models consisting of a few thousand triangles such as the shark

model, both texture caching and VBO schemes obtain 100 frames per second which is the

frame rate limit of the screen.

58

(a) (b)

(c)

Figure 4.11: Budha model with texture caching: a) 4 floats scenario with no lighting, b) 4 floats scenario with pre-calculated lighting, c) 8 floats scenario with real-time lighting.

59

For bigger models such as the budha model, both schemes obtain the 100 frames per

second limit for the 4 float scenario. However, the texture caching scheme fails to reach the

100 FPS limit for the 8 floats scenario, and achieves only half the rate. The frame rate

deceleration occurs because the 8 floats scenario needs two fetches per vertex, instead of

only one fetch with the 4 floats scenario. This result implies the relatively big latency of

fetching vertex textures as mentioned by Kilgariff and Fernando’s [15] review of the shader

model 3.0 programming model.

To see the real comparison between the schemes, a larger model should be checked. The

teeth model consists of over a quarter of a million triangles, so even the VBO fails to reach

the 100 FPS limit for the 8 floats scenario when rendering it. The 4 floats scenario just

reaches the 100 FPS limit. Comparing these results to the texture caching scheme yields a

three to one ratio in rendering speed in favor of the VBO scheme, which is actually not

surprising due to the latency of fetching vertex textures.

To get a proof that this latency is the main issue slowing our algorithm, we implemented

the short coordinates optimization. Although with this optimization the CPU/GPU

communication is halved, no change at all is detected in the overall rendering speed. Such a

strange result can only occur when a process has a clear bottleneck that virtually cancels the

effect of increased effectiveness in other parts of the process. The only thing that might cause

such a bottleneck is fetching a vertex texture, because this is the only operation that appears

only in the texture caching scheme, and not in traditional VBO scheme. However, according

to the graphics hardware manufacturers this latency will vastly decrease in future graphics

hardware.

To show the results of the geometry instancing optimization we use instancing of 300

budha models that are positioned similarly to one line of budhas in the Sanjusangendo

temple. To do that, we only use one budha model that is cached in the textures. The position

of each budha instance is shifted according to its instance number.

60

Figure 4.12: 300 Buduas using instancing in the 8 floats scenario.

To the 300 budhas we also added the triangle strips optimization. We triangle strip the

budha model and achieve the same vertices and triangles count. This way a true comparison

can be made between a regular model and a triangle stripped model (TS).

Model Vertices Triangles Frames Per Second (FPS)

TC-4 TC-8 VBO-4 VBO-8

300 Budhas ~10 million ~20 million 0.479 0.234 1.39 0.61

300 TS budhas ~10 million ~20 million 1.43 0.704 4 1.69

Table 4.2: Results of the Texture Caching and VBO schemes in 4 floats and 8 floats scenarios for 300 budhas with and without triangle strips.

61

The results still point to the fact that with the current latency of fetching vertex textures,

the traditional VBO scheme achieves better results than our texture caching scheme.

However, the ratio between VBO and texture caching rendering speeds is reduced when

using instancing and triangle strips.

Figure 4.13: FPS ratio between VBO and texture caching in 4 floats and 8 floats scenarios for the teeth model, the 300 budhas instancing and the 300 triangle stripped budhas instancing.

Triangle strips reduce the number of vertices going through the graphics pipeline on each

frame by a magnitude of three. With reduced number of vertices per frame, we also reduce

the number of vertex texture fetch operations. This fetches imply the greatest latency in the 8

floats scenario, leading to the best ratio achieved with the 300 triangle stripped budhas in the

8 floats scenario.

62

Chapter 5

GPU-Based Terrain Level of Detail Rendering using Displacement Mapping

We present a GPU-based novel algorithm for run-time rendering of large terrain models.

Similarly to the GPU-based version of geometry clipmaps [2], our algorithm also sends

constant vertex and polygon lists from the CPU to the GPU. Displacement mapping is used

to derive the elevation data from 2D vertex textures representing height maps. As with the

geometry clipmaps, our algorithm also implies very little computation on the CPU side, with

most of the work done in the faster and more powerful GPU. However, contrary to the

geometry clipmaps, our algorithm does not have to use zero-area triangles and transition

regions to deal with problems such as cracks and popping effects. Our algorithm overcomes

these limitations by using progressive levels of detail opposed to the discrete levels of detail

used in the geometry clipmaps algorithm. Our algorithm is based on extracting the elevation

data of each vertex from a displacement map that resides on the texture memory of the GPU.

The CPU part of our algorithm calculates the intersections of the terrain with the view

frustum in the beginning of each frame with respect to the position and angle of the camera

(the viewpoint). We refer to the surface between these intersections as the frustum surface.

The CPU sends to the GPU four points that define the frustum surface and a constant

rectilinear grid that includes constant vertex and polygon lists. We refer to this grid as the

rectilinear grid. The elevation data of the entire terrain might be too big to fit in one texture,

therefore the CPU also has to manage an out-of-core scheme that resolves this problem.

63

The rectilinear grid received from the CPU is mapped by the GPU to the frustum surface.

The x and y coordinates of each vertex in the grid received by the GPU are mapped to their

relative position in the frustum surface using simple algebraic calculations. The new x,y

coordinates are also used to extract the elevation value of the vertex from the displacement

map.

Figure 5.1: Mapping a rectilinear grid of 9 X 9 vertices to the frustum surface. The mapped grid on the right is not triangulated for clarity.

In the frustum surface, the area closer to the camera is narrower than the areas further

from the camera. Mapping the rectilinear grid to such a surface, results in vertices near the

camera being closer to each other, thus denser. Higher resolution of vertices closer to the

camera yields a higher level of detail near the camera. The level of detail progressively

decreases for areas in the frustum surface that are further away from the camera.

Such a framework displays a constant number of vertices in continuous and progressive

levels of detail. The algorithm insures a constant frame rate regardless of the size or

complexity of the terrain.

Mapping

Rectilinear grid

Frustum surface

64

Camera

5.1 CPU

Level-of-detail rendering algorithms usually rely on per-vertex computations in the CPU.

Even when the computations are simple, they are repeated for each vertex which makes the

overall calculation burden the CPU. Our algorithm relieves the CPU from most of the

workload, by moving all per vertex calculations to the GPU. The only tasks the CPU has to

perform each frame are defining the current frustum surface and sending a rectilinear grid to

the GPU.

Defining the current frustum surface involves some calculations, in the form of

intersections of vectors with planes. It is done only once at the beginning of each frame,

therefore its influence on the CPU’s workload in negligible.

Sending the rectilinear grid to the GPU is even a simpler task in the terms of CPU

workload. The grid is constant therefore no calculations at all are made in the CPU

concerning this task. Because the grid is constant, it can also be efficiently cached on the

AGP using the API’s VBO extension. Therefore, it does not overload the CPU/GPU

communication.

The out-of-core version of the algorithm does however need to partition the grid into

several sub-grids. Nevertheless, the vertices of the grid still remain constant, so the

partitioning only slightly effects the overall rendering time.

5.1.1 Frustum Surface

Defining the frustum surface is simply performed by a few algebraic calculations, similar to

view frustum calculation. The intersection points of the view frustum’s top and bottom

planes with the terrain plane are the four points needed for defining the frustum surface.

As with the lens of a hand-held video camera, the computer graphics camera (viewpoint)

also has a virtual screen in front of it, on which the image is displayed. This virtual screen is

called the viewport, and its exact position and size are calculated using different parameters

65

such as the focal length of the camera. The viewport can also be thought of as the window on

which the image is displayed on.

To find the intersection points of the view frustum with the terrain plane, we shoot rays

(or vectors) from the viewpoint to the four corners of the viewport. The ray going to the

bottom-left corner of the viewport is called , the ray going to the bottom-right corner is

called , and respectively the rays going to the top corners are called and . The four rays

continue to intersect the terrain plane in four points that are named A, B, C, and D

respectively.

Figure 5.2: Defining the frustum surface.

The four intersection points usually define the frustum surface, but there are some cases

when part of the A, B, C, and D points should be repositioned. Such cases are when the C

and D points fall behind the viewpoint, or when the A and B points are too far from the

viewpoint. Next we will review these cases.

When the horizon is seen, it means that the top plane of the view frustum does not

intersect the terrain plane. In this case the top plain of the view frustum actually intersects the

terrain plane behind the viewpoint, opposite to the viewing direction of the camera. In such a

scenario the C and D points should be repositioned to the far side of the view frustum. The

far side of the view frustum is a parameter that can be defined according to the application or

the scene, but it must be far enough from the camera to insure that no visible geometry is

ViewpointViewport

A→

B→

C→

D→

Terrain plane

Frustum surface

66

culled. When the and rays are almost parallel to the terrain plane and intersect it further

than the far side of the view frustum, the C and D points are again repositioned to the far side

of the view frustum.

(a)

(b)

(c)

Figure 5.3: Repositioning of points C and D: a) the top plain of the view frustum intersects the terrain plane behind the viewpoint, b) the top plain of the view frustum intersects the

View frustumtop plane far plane

Terrain plane

View frustumtop plane far plane

Terrain plane

Terrain plane

B

C

D

A

Frustum surface

67

terrain plane very far from the viewpoint, c) C and D points repositioned according to the far plane of the view frustum for both (a) and (b) scenarios.

In some cases the and rays could be problematic. When calculating the frustum

surface we only refer to the base terrain plane completely ignoring the elevation data, so a

high mountain right in front of the camera could be ignored in such a situation. We should try

not to involve the elevation data in the frustum surface calculation, because that obligates the

CPU to perform per-vertex calculations and therefore contradict the GPU-based nature of our

algorithm.

To prevent such cases, we have to choose the A and B points much closer to the camera,

so a high mountain right in front of the camera would be rendered and not ignored. However

this is view and scene dependent, for instance in the case of a flight simulator that flies high

above the terrain such a scenario will hardly occur. Therefore, the A and B points could be

positioned anywhere between right beneath the camera, and the original intersection points of

the and rays with the terrain plane. This is defined by a pre-calculated near parameter

that changes according to the application or the scene.

There is a possibility of calculating the parameter at run-time using some heuristics that

depend on pre-computed data to avoid extensive computations in the CPU. These heuristics

must guarantee that the parameter changes smoothly with no jumps to insure that no popping

effects occur. If height averages over constant-sized regions are calculated in preprocessing,

then a heuristic that determines the parameter based on these pre-computed values can be

used.

68

(a)

(b)

Figure 5.4: Mount Everest, Nepal: a) the and rays go right through the Everest but the mountain is left outside the frustum surface hence not rendered, b) A and B points repositioned, so the mountain enters the frustum surface and rendered correctly.

AB

C

D

AB

C

D

69

(a)

(b)

Figure 5.5: Side view of figure 5.4: a) the mountain is outside the frustum surface, b) A and B points repositioned, so the mountain top enters the frustum surface.

5.1.2 Constant Rectilinear Grid

After defining the frustum surface the CPU has to send the geometry to the GPU. The CPU

just creates a constant rectilinear vertex grid which is sent to the GPU. The grid implies

constant vertex and polygon lists therefore it can be efficiently cached on AGP memory

using the VBO extension.

The size of the grid is determined once in the beginning of the rendering, and from that

point the exact constant grid is sent every frame to the GPU, relieving the CPU from almost

A,B C,D

Frustum surfaceBase terrain plane

A,B C,D

Frustum surfaceBase terrain plane

70

any work in each frame. The size is of the grid is mainly derived from the performance of the

GPU. A bigger grid implies better representation of the model, but a slower frame rate.

Therefore, the size of the grid is chosen as the maximal size that allows an interactive frame

rate of at least 24 frames per second.

5.1.3 Grid Partitioning

As stated in the previous section, the vertex grid remains constant throughout the running of

our algorithm. However, due to out-of-core considerations, sometimes the grid has to be

partitioned. The grid is only horizontally partitioned, so that the resulting sub-grids will not

have any connectivity problems. This means that the grid is partitioned along horizontal lines

of vertices. This way, triangle strips can be used freely without the possibility that a triangle

strip ends in the middle of a line of triangles because of some partitioning.

In some cases, mainly when using the out-of-core version of the algorithm, the usage of a

single displacement map (vertex texture) for rendering the entire terrain is not enough. In

these situations, different areas of the grid need to use different vertex textures, therefore

unbinding and binding of these textures must be done while the grid is being sent to the GPU.

In order to switch the vertex textures, the CPU must first horizontally partition the grid to

several sub-grids, in a way that the texture bind/unbind operations occur between the sending

of two sub-grids.

Horizontally partitioning the grid seems like a pretty simple and safe task however this is

not exactly the case because cracks may appear on the borderlines of two sub-grids. This

problem occurs because each vertex in the rectilinear grid is sent to the GPU with two lines

of triangles, first as a top vertex for the triangle/s beneath it, and later as a bottom vertex for

the triangle/s above it. A problem arises when the vertex texture is changed between the two

lines of triangles, meaning that two different elevation values could be fetched by the GPU

for practically the same vertex, thus resulting in a crack.

To solve this problem the CPU can send zero-area triangles along the sub-grids

borderlines as done in the GPU-based version of geometry clipmaps [2]. There, the zero-area

71

triangles are essential because they deal with T-junctions created by borders of different

clipmap sizes. In our algorithm, we can solve this problem in a much more elegant way, by

using two bound textures at a time – the main displacement map and an auxiliary one. A

simple branching mechanism in the vertex processor chooses between the auxiliary

displacement map in the case of a borderline vertex, and the main displacement map for the

rest of the vertices. In such a scheme, the vertices of the first line of each sub-grid have the

same elevation as the same vertices in the last line of the previous sub-grid, thus resolving

the cracks problem.

This solution is very time-costly due to the known latency of branching in the GPU, which

is an operation that stalls the GPU pipeline, thus completely contradicting the parallel SIMD

(Single Instruction, Multiple Data) nature of the GPU’s components. However, as stated in

the work by Harris and Buck [13] SIMD branching is very useful in cases where the branch

conditions are fairly spatially coherent. In our solution, this is exactly the case, since the path

to the auxiliary texture is taken only on borderline vertices, implying that the greater majority

of the vertices take the ordinary path in a very spatially coherent way.

The grid no longer being constant throughout the running of our algorithm is another issue

that arises. However, this is not a critical problem, because the vertices of the grid still

remain constant, therefore the partitioning only slightly effects the overall rendering time.

Moreover, the sending time of the relatively small grid is not very time consuming in the first

place.

72

5.2 GPU

The GPU receives in the beginning of each frame four points from the CPU. These points

define the four corners of the frustum surface. Then, the GPU expects a constant rectilinear

grid from the CPU. All the vertices in the grid go through the vertex shader, where they are

mapped to their relative place in the frustum surface. For each vertex, the vertex shader also

fetches the elevation data of the vertex from the displacement map.

Fetching a texture implies great latency, but in spite of that it is worthwhile even to fetch a

texture for every vertex in the grid. The superior computational power of the GPU compared

to the CPU as expressed by the mapping operation made for each vertex covers for the

fetching latency. Furthermore, this calculation is performed in the GPU during the idle time

that the fetching texture latency implies therefore it is actually not even slowing the GPU

down further from the texture fetching latency. In addition, it is also possible to use regular

RGB textures along with the vertex textures. The RGB textures are used by the fragment

processor.

5.2.1 Mapping the Rectilinear Grid to the Frustum Surface

The first operation done by the vertex shader is repositioning each received vertex, which is

done by mapping the position of the vertex in the constant rectilinear grid to its position in

the frustum surface.

Consider an m X n sized rectilinear grid, and a frustum surface defined by four points A,

B, C, and D. The 0,0 vertex in the grid is mapped to point A, the 0,n-1 vertex is mapped to

point B, the m-1,0 vertex is mapped to point C, and the m-1,n-1 vertex is mapped to point D.

Each inner vertex in the grid is mapped to its corresponding place in the frustum surface by

applying a set of calculations.

73

Figure 5.6: Mapping the corners of the rectilinear grid to the corners of the frustum surface.

Consider the x,y vertex in the grid. We calculate its corresponding position in the frustum

surface using vectors, because vector calculations are more suitable for the GPU. The A, B,

C, and D points are treated as the vectors , , , and respectively. The vector

represents the relative position of the vertex along the A-C edge. It is calculated by adding

percent of – to vector :

(Equation 5.1)

Similarly, the vector that represents the relative position of the vertex along the B-D

edge is calculated:

(Equation 5.2)

Rectilinear grid Frustum surfacem-1

0

1

2

3

4

5

6

7

8

9

10

11

m-4

m-3

m-2

m-5

0 1 2 3 4 5 6 7 8 9 10 n-4 n-3 n-2 n-111 n-5

Camera

BA

C

D

74

If we refer to the AC-BD edge, created by equations 5.1 and 5.2, then the final position

represents the relative position of the vertex along this edge:

(Equation 5.3)

The vertex shader repositions each vertex using the three equations, so eventually all the

constant rectilinear grid is mapped to the frustum surface.

5.2.2 Extracting Elevation Data from the Displacement Map

After a vertex is mapped to the frustum surface, it has the correct real-world x,y coordinates,

but it still misses the elevation data, which is extracted from the displacement map. An x,y

point in the displacement map contains the elevation data of a vertex located in the same x,y

point in the real-world, or the frustum surface. This means that the x,y coordinates as derived

from vector (equation 5.3), are exactly the same coordinates in the displacement map where

the elevation data of the vertex is stored. The result of equation 5.3 can be used to extract the

elevation data from the displacement map using the vertex texture fetch operation.

There is a problem though, when the x,y coordinates derived from vector are placed

outside the terrain model, and therefore outside the displacement map. In such a case, the x

coordinate of the displacement map is clamped to the closest value available in the map, and

so does the y coordinate. One coordinate is clamped when only that coordinate is outside the

displacement map. In such a scenario, the other coordinate remains with its value as derived

from vector . This clamping implies that elevation data of vertices outside the terrain model

is copied from the elevation in the edges of the displacement map, creating an inaccurate

representation of the terrain.

75

In order to overcome this limitation, the values on the frame of the displacement map are

set to zero. With such approach, every vertex outside the terrain receives a zero elevation

value, and the terrain is accurately represented.

5.3 Out-of-Core

Out-of-core techniques are used to efficiently support view-dependent simplification for

datasets much larger than main memory. In our framework, out-of-core refers to a much

more fundamental problem than main memory capacity, because the maximal texture size is

much smaller than main memory. In current hardware the maximal size of textures is 4096 X

4096, therefore to use our algorithm in-core, meaning using a single displacement map, our

terrain is limited to a maximum of 16777216 vertices (16M). Some datasets are much larger

than this limit.

We introduce an out-of-core technique to support large datasets in our framework. At

preprocessing we create a height map pyramid, where each level in the pyramid includes all

the dataset in its corresponding level of detail. The base of the pyramid stores the original

height map, whereas the top of the pyramid stores the coarsest height map. At run-time, the

constant rectilinear grid is partitioned into smaller sub-grids, where each sub-grid is

associated with a different level of the height map pyramid based on geometry clipmaps. The

clipmaps are incrementally updated as the viewpoint moves.

5.3.1 Height Map Pyramid

We construct a pyramid of height maps from the original height map of the entire dataset.

The base of the pyramid is the original height map. The pyramid construction process builds

the rest of the pyramid level by level. Each new level uses the height map of its previous

level and a geometric approximation metric to select a coarser data representation. The data

76

in each level is coarser than in its previous level since it consists of only half the control

points over each axis, and therefore requires four times less memory.

Figure 5.7: Height map pyramid (unscaled).

Each point in a height map corresponds to the elevation value of a vertex in a level of

detail fitting the level of the height map in the pyramid. During the construction of a

particular pyramid level, the elevation value for every vertex in that level is calculated based

on the elevation values of vertices from the previous level. A single vertex value in the new

pyramid level is based on a 5 X 5 vertex matrix representing its adjacent geometry in the

previous pyramid level. Any naive metric such as the average metric can be used here,

although we use a novel metric to achieve better results. The basic idea of our metric is to

find a 3 X 3 vertex matrix that approximates the 5 X 5 matrix with minimal geometric error.

The returned value of the metric is the elevation value of the central vertex in the 3 X 3

matrix.

5.3.2 Grid Partitioning based on Clipmaps

When rendering huge terrains, a single displacement map is not able to hold all the elevation

data of the terrain. In order to cover the entire terrain, we must therefore use a coarser

representation of the terrain. To maintain a fairly high resolution to the elevation data of the

Original height map

77

areas near the viewpoint, and at the same time cover the entire terrain, we must use height

maps of various levels of detail. For that purpose we use a simple version of geometry

clipmaps [18]. Contrary to the original geometry clipmaps, our clipmaps are in the shape of

full rectangles without the holes corresponding to the clipmaps of previous levels.

We use the intersection points of the clipmaps with the frustum surface to determine the

vertex grid partition. Our algorithm starts from the finest clipmap that is placed around the

viewpoint, and finds the maximal sub-grid that is fully covered by that clipmap. We continue

this process until the entire frustum surface is covered by clipmaps. Since our partitioning is

only horizontal, some vertices might be covered by an outer and coarser clipmap even though

they are placed within the area of a finer clipmap. This is the reason why our clipmaps are

full rectangles with no holes in their centers, contrary to the ring-shaped geometry clipmaps

[2].

Figure 5.8: Grid partitioning based on clipmaps.

The clipmaps are also used by the GPU as the displacement maps from where the

elevation data is fetched. At every point there are exactly two bound clipmaps – one clipmap

as the main displacement map, and another clipmap as the auxiliary displacement map.

78

Following the partitioning of the grid, the resulting sub-grids are sent to the GPU. After a

particular sub-grid is sent, its associated clipmap is re-bound as the auxiliary displacement

map, while the previous auxiliary displacement map is unbound. The clipmap associated with

the next sub-grid is bound as the main displacement map.

5.3.3 Updating Clipmaps

As the viewepoint moves, each clipmap translates within its pyramid level in order to remain

centered about the viewpoint. Since the motion of the viewpoint is usually coherent, only a

small L-shaped region of the window needs to be incrementally processed in each frame.

Furthermore, the relative motion decreases exponentially at coarser levels, therefore coarse

level clipmaps seldom require updating.

Figure 5.9: L-shaped region created between sequential frames (t and t+1) within a clipmap.

Rendering to the textures (clipmaps) using the fragment shader enables to modify the L-

shaped regions of the clipmaps. Rendering to textures is sometimes a time-costly operation,

but because we update only a thin L-shaped region for each clipmap, the overall updating

time is not overloading the system.

Clip region (t)

Clip region (t+1)

Viewer

motion

L-shaped region

79

5.4 Optimizations

5.4.1 Terrain Compression

Storing the geometric data as images in the form of height maps (or displacement maps)

instead of traditional API storage, enable the usage of image compression techniques.

Moreover, height maps are remarkably coherent in practice, significantly more than typical

color images, and thus offer a huge opportunity for compression. For example, Losasso and

Hoppe [18] managed to compress the 40.4GB U.S. dataset to just 355MB, while maintaining

a quite small error.

5.4.2 RGB Textures

When using only vertex textures, the coloring of the terrain is done automatically by the

fragment processor that interpolates vertex colors. Using such a method implies poor results,

since sharp changes in color disturb the human eye much more than the geometrical changes

do. For that reason, RGB textures are used by the fragment processor. RGB textures are

usually more detailed than vertex textures (height maps), so a 1K X 1K height map will be

usually accompanied by a matching 2K X 2K RGB texture. Both types of textures are placed

in the texture memory therefore they must not together surpass the texture memory limit.

Bear in mind that when using RGB textures with the out-of-core version of our algorithm,

a pyramid of RGB data will be also used along with the height map pyramid. The RGB

pyramid is built using a metric that calculates each color component separately. In addition,

whenever the L-regions of the clipmaps (displacement maps) are updated due to movement

of the viewpoint, the corresponding RGB textures are updated accordingly.

80

5.4.3 Linear Sampling

Taking the elevation data of a vertex in the frustum surface from a single value in the

displacement map may result in popping effects. This happens when the frustum surface

moves slightly forcing the elevation data to be taken from a different texel in the vertex

texture. The problem can be solved by using linear sampling when fetching the elevation data

from the displacement map.

Linear sampling interpolates the elevation data from four texels in the vertex texture that

surround the position of the vertex, instead of just taking the data from the closest single

texel. Such a method insures smooth changes instead of popping effects when the elevation

data is suddenly taken from a different texel in the vertex texture.

Figure 5.10: Linear sampling of vertex P between texels P1, P2, P3, and P4.

The interpolated elevation data of P is calculated using the next equation:

(Equation 5.4)

Each vertex has to fetch the elevation data of the four adjacent texels. This yields four

vertex texture fetch operations per vertex, which is time-costly in current hardware.

However, with a single fetch operation four floats of data can be fetched, so if we place the

P1 P2

P

P3 P4

x’

y’

81

elevation data of all the neighbors in each texel of the vertex texture, then we can obtain all

the needed elevation data with just a single fetch. Therefore, the first float in each texel

contains the elevation value of the texel itself, the second float contains the value from the

texel to the right, the third float contains the value from the texel below, and the fourth float

contains the value from the bottom-right texel. Now a texel in the vertex texture is actually a

quad of texels, where the top-left corner of the quad is the texel itself, and the rest are

neighboring texels.

To insure that the vertex fetches the correct quad of texels we must make sure that the

closest texel in the vertex texture is its top-left texel. When fetching the texel we move the

vertex half a texel left and half a texel up to insure that the correct quad is fetched.

Figure 5.11: Vertex P moves half a texel left and half a texel up to insure that P1 is the closest texel, thus the correct quad is fetched.

Keep in mind, that linear sampling implies at least double texture memory footprint, since

each texel has to store data of adjacent texels apart of its own. Linear sampling for RGB

textures is done by changing a texture parameter (done by an API call) when the texture is

uploaded.

P1 P2

P

P3 P4

82

5.4.4 Explicit Level-of-Detail Coarsening

Linear sampling helps solving popping effects when two consecutive elevation data fetches

of a vertex are from neighboring texels in the vertex texture. However, if the used

displacement map is too detailed, two consecutive fetches might result in non-neighboring

texels, and in that case linear sampling may not help. Therefore, we should use a coarser

displacement map in such cases, in order to explicitly lower the level of detail even when the

texture size allows a higher level of detail.

When using the out-of-core version of our algorithm, the clipmaps implicitly solve this

problem because the level of detail of each clipmap is subject to its distance from the

viewpoint. We need to explicitly coarsen the level of detail when the entire terrain is stored

in a single displacement map, or when we want to achieve finer changes in the level of detail

than the clipmaps dictate.

Both vertex textures and RGB textures may use explicit level-of-detail coarsening, since

both pyramids were built using “smart” metrics that use neighborhood data. Therefore,

choosing a texel of a lower level of detail insures that a larger area was considered when the

value of the texel (both elevation and RGB data) was calculated.

5.4.5 Utilizing Parallelism

The GPU is a stream processor, and not a serial processor like the CPU is [3]. A serial

processor, also known as von Neumann architecture executes instructions sequentially,

updating the memory as it goes. Contrary, a stream processor executes a function (for

example a vertex shader) on a set of input records (vertices), producing a set of output

records (transformed vertices) in parallel. This kind of processor is also known as an SIMD

(Single Instruction, Multiple Data).

83

Not only that the same vertex shader instructions can be parallely executed to several

vertices, the shader can also perform algebraic operations on vectors of four floats, and 4 X 4

matrices parallely [29].

We can take advantage of this fact by uniting the computations of equations 5.1 and 5.2

into one operation of matrix–vector multiplication. First, we can calculate the

– and – vectors once in the CPU, because their values remain the same for all the

vertices in a particular frame. We will name these vectors and respectively.

Since we do not deal with elevation data in these calculations, a vector can be represented

by its x and y components – Vx and Vy respectively. The following matrix-vector

multiplication returns a vector of four floats, where the vector’s first two floats are the

components of vector (result of equation 5.1), and the remaining floats are the

components of vector (result of equation 5.2):

(Equation 5.5)

The results are used by equation 5.3 to compute the final Fx and Fy coordinates in the

frustum surface.

84

5.5 Implementation

We have implemented our algorithm in C++ with OpenGL as our API, and Cg as the

programming language for the GPU shaders.

The basic version of our algorithm is implemented along with the RGB textures

optimization, where all the geometry is sent using the VBO extension of the API. The

algorithm was checked on a terrain with a 1k X 1k height map and a 2k X 2k RGB texture

map, and also on a terrain with a 2k X 2k height map and a 4k X 4k RGB texture map.

Our vertex shader utilizes the parallelism of the GPU by using operations on two-float

vectors to map the vertices from the rectilinear grid to the frustum surface. The code of the

vertex shader is listed at the end of this section. The fragment shader just uses the same

texture coordinates as the vertex shader to fetch RGB color data.

We have also implemented a CPU-based version of the algorithm for performance

comparisons. The CPU-based version maps the rectilinear grid to the frustum surface in the

CPU, thus no displacement mapping or any other special operation is performed in the vertex

shader. The mapped vertices are sent to the GPU in a straightforward way using Vertex

Arrays. VBO can not be used in the CPU-based version, because the vertices dynamically

change every frame, so no caching whatsoever is possible in such a case.

To compare the results of our algorithm to a naive scheme, we have also implemented a

VBO-based straightforward version that renders the full terrain each frame.

85

The code for the vertex shader:

Listing 5.1: Vertex shader code.

86

5.6 Results

We have tested our implementation on a Pentium-IV 3Ghz 1GB RAM machine with an

Nvidia Geforce 6600 GT (128 MB texture memory) graphics card. We compare the results of

our basic algorithm with the results of the CPU-based version for constant rectilinear grids of

various sizes from 128 X 128 vertices up to 600 X 600 vertices. All grid sizes were tested on

two terrains – one with a 1K X 1K height map and a 2K X 2K RGB texture map, and another

terrain with a 2K X 2K height map and a 4K X 4K RGB texture map. Even though RGB

textures were added to our algorithm, and not to the CPU-based version and the naive

versions, our algorithm still outperforms them both in all the test cases.

Algorithm Size of constant

rectilinear grid

Vertices in grid Frames Per Second (FPS)

1K X 1K terrain 2K X 2K terrain

Our algorithm

128 X 128 16,384 100 (max) 100 (max)

256 X 256 65,536 100 (max) 100 (max)

400 X 400 160,000 50 50

512 X 512 262,144 34 34

600 X 600 360,000 25 25

CPU-based

version

128 X 128 16,384 100 (max) 100 (max)

256 X 256 65,536 46 46

400 X 400 160,000 24 24

512 X 512 262,144 21 21

600 X 600 360,000 18 18

Naive version

with VBO- - 22

Up to 5.5

(1.0 using VA)

Table 5.1: Results of our algorithm, the CPU-based version, and the naive version for various sizes of the terrain and the rectilinear grid.

87

(a)

(b)

Figure 5.12: A rendered terrain using our algorithm with a 128 X 128 rectilinear grid: a) wire frame, b) with RGB texture.

88

(a)

(b)


89

(a)

(b)


90

Figure 5.15: Frustum surface defined by the rays shot through the corners of the viewport.

The results show that the size of the terrain does not affect our algorithm, and as suspected

the frame rates remain the same for both terrain sizes even though the second terrain is four

times bigger than the first. The size of the large terrain exceeds the AGP memory of the

machine. Contrary to our algorithm, the naive version reacts drastically to the size of the

terrain, and as a result it fails to use VBO for the large terrain and falls back to just Vertex

Arrays performance. Dividing the terrain to patches can solve this problem and enable at

least partial use of VBO. Nevertheless, the rendering speed does not pass 5.5 Frames Per

Second.

In our algorithm the latency of vertex texture fetches practically dictates the frame rates.

However, this is not so bad, because the fetch latency “hides” all the other work done by the

GPU due to its parallel nature. Consequently, adding RGB textures did not change the frame

rates. Removing the calculation of mapping each vertex from the rectilinear grid to the

frustum surface did not change the frame rates either, therefore we suspect that even more

GPU calculations can be “hidden” by the fetch latency. This is the reason why we only

partially harness the parallelism of the GPU in our implementation, using operations on two-

float vectors instead of matrix operations. We did it for code clarity, but if fetch latency

91

reduces drastically in future technologies, we can use matrix operations that fully utilize the

GPU’s parallelism.

Compared to the mapping calculations in the GPU that did not change the frame rates, in

the CPU-based version this calculation took about the same time as rendering the resulting

vertices, and it gives an idea to how fast and parallel the GPU is compared to the CPU.

For huge terrains we still render the same amount of vertices, so the rendering time is not

effected at all by the size of the terrain. For the out-of-core version we have to create the

height map pyramid at preprocessing, whereas the run-time addition is the clipmap updating.

With this addition there is a correlation between the terrain size and the frame rates, since a

larger terrain means additional clipmap/s to be updated. However, this correlation is only

logarithmic, because even when the terrain size grows by four times, the algorithm only

needs to update one additional outer clipmap. Outer clipmaps usually need even less updating

than clipmaps near the viewpoint, because their L-shaped region is usually so thin that no

updates are needed in most frames.

Our clipmaps are updated exactly as the original geometry clipmaps introduced by

Asirvatham and Hoppe [2], so we can refer to the updating times from their implementation.

They tested worst-case updating times, when the entire clipmap is updated instead of just a

thin L-shaped region as in the average case. The updating time for a 255 X 255 clipmap is

1.6 ms without terrain compression, and 9.6 ms with on-the-fly decompression of the height

map.

On top of the updating time, they also have the rendering time, which logarithmically

grows as the terrain grows, while our rendering time is completely constant. Our algorithm

also allows great adaptivity to the abilities of the machine, because the quality of our

rendered image is directly derived from the size of the rectilinear grid. The size can be

adjusted per machine to give maximal image quality at an interactive frame rate.

It is important to emphasize that our algorithm can not be outperformed by the GPU-based

geometry clipmaps [2] because the clipmap updating in both algorithms is precisely the

same, while the rendering time of both algorithms is derived from the latency of a single

vertex texture fetch made for each vertex.

92

The geometry clipmaps algorithm uses zero-area triangles to prevent T-junctions in the

borderline between two clipmap levels. Such triangles may result in visual artifacts, therefore

they should be avoided. In contrast, we use progressive levels of detail therefore no zero-area

triangles are created with our algorithm.

93

Chapter 6

Conclusions and Future Work

We have presented two novel approaches for utilizing the growing power of GPUs for level-

of-detail rendering. The first approach enables caching of geometry data on the texture

memory of the GPU, providing full control on the caching management. The second

approach allows view-dependent level-of-detail rendering of terrain models with adaptive

performances that take maximal advantage of the machine’s computational capabilities.

Our texture caching technique reaches reasonable results and enables the rendering of

small models in interactive frame rates. However, it falls behind the VBO extension of the

APIs, which performs the caching over the AGP memory. The main reason for the trailing

behind VBO is the latency involved with fetching vertex textures in current graphics

hardware. The ability to fetch vertex textures is very new, and therefore this latency is

expected to drastically decrease in the future. Nevertheless, until this happens in practice,

texture caching remains inefficient.

Our terrain rendering algorithm enables view-dependent rendering of terrains with

continuous and progressive levels of detail. It insures a constant and interactive frame rate

regardless of the size or complexity of the terrain by channeling all the computational power

of the hardware to render just vertices that are inside the view frustum. The main bottleneck

of the algorithm is again the latency of vertex texture fetches used by the GPU for each

vertex. Nevertheless, opposed to the texture caching technique, the terrain rendering

algorithm achieves results that do not fall behind similar algorithms, and even outperform

them.

94

We see the scope for future work in improving the quality of the image displayed, by

using an improved technique for mapping each vertex to the frustum surface. In our current

algorithm, we map the vertices evenly throughout the frustum surface. This way, the vertices

become implicitly denser near the camera. By using a function that relies on the ratio

between the near and the far edges of the frustum surface, we can map the vertices in a way

that they become explicitly denser near the camera.

Our current out-of-core scheme often uses a coarse texture (clipmap) where a finer texture

can be used, because of the horizontal partitioning of the grid. By selecting the clipmap level

of each vertex in the GPU shaders, this lose of quality can be avoided. Moreover, with this

method the grid does not have to be partitioned any more, resulting in efficiency

improvement too. Selecting the clipmap level within the shaders can be done by using

mipmaping.

Future improvement in the efficiency of fetching vertex textures will drastically improve

the results of both our approaches. This work is a cry out to the graphics hardware vendors to

cut down the latency of fetching vertex textures, and relief our approaches and many other

algorithms of their main bottleneck.

95

Bibliography

[1] Anonymous. AGP 8X A Closer Look. In Dev Hardware Website, October 2003.

http://www.devhardware.com/c/a/Video-Cards/AGP-8X-A-Closer-Look.

[2] A. Asirvatham and H. Hoppe. Terrain rendering using GPU-based geometry clipmaps. In

GPU Gems 2, edited by M. Pharr and R. Fernando, pages 27-45. Addison-Wesley, March

2005.

[3] I. Buck and T. Purcell. A toolkit for computation on GPUs. In GPU Gems, edited by R.

Fernando, pages 621-636. Addison-Wesley, March 2004.

[4] F. Carucci. Inside Geometry Instancing. In GPU Gems 2, edited by M. Pharr and R.


[5] UNC Chapel Hill. Armadillo model, 1998. http://www.cs.unc.edu/~geom/APS.

[6] P. Cignoni, C. Montani, and R. Scopigno. A comparison of mesh simplification

algorithms. In Computers & Graphics, 22(1):37-54, February 1998.

[7] M. Duchaineau, M. Wolinsky, D. E. Sigeti, M. C. Miller, C. Aldrich, and M. B. Mineev-

Weinstein. ROAMing terrain: real-time optimally adapting meshes. In IEEE Visualization

’97 Proceedings, pages 81-88. ACM/SIGGRAPH Press, October 1997.

[8] J. El-Sana, E. Azanli, and A. Varshney. Skip strips: maintaining triangle strips for view-

dependent eendering. In Proceedings IEEE Visualization ‘99, pages 131-138. IEEE

Computer Society and ACM, October 1999.

96

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/v/Varshney:Amitabh.html

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/a/Azanli:Elvir.html

http://www.cs.unc.edu/~geom/APS

http://www.devhardware.com/c/a/Video-Cards/AGP-8X-A-Closer-Look

[9] J. El-Sana and A. Varshney. Generalized view-dependent simplification. In Computer

Graphics Forum, volume 18, pages C83-C94. Eurographics Association and Blackwell

Publishers Ltd 1999, 1999.

[1] R. Fernando. Shader model 3.0 unleashed. In SIGGRAPH 2004 Presentation, August

2004.

http://download.nvidia.com/developer/presentations/2004/SIGGRAPH/Shader_Model_3

_Unleashed.pdf.

[2] R. Fernando and M. J. Kilgard. The Cg tutorial: the definitive guide to programmable

real-time graphics. Addison-Wesley, February 2003.

[3] P. Gerasimov, R. Fernando, and S. Green. Shader model 3.0: using vertex textures.

NVIDIA white paper DA-01373-001_v00, June 2004.

[4] M. Harris and I. Buck. GPU flow-control idioms. In GPU Gems 2, edited by M. Pharr

and R. Fernando, pages 547-555. Addison-Wesley, March 2005.

[5] H. Hoppe. Progressive meshes. In Proceedings SIGGRAPH ’96, pages 99-108. ACM

SIGGRAPH, ACM Press, August 1996.

[6] E. Kilgariff and R. Fernando. The GeForce 6 series GPU architecture. In GPU Gems 2,

edited by M. Pharr and R. Fernando, pages 471-491. Addison-Wesley, March 2005.

[7] R. Lario, R. Pajarola, and F. Tirado. Cached Geometry Manager for view-dependent

LOD rendering. In WSCG 2005 Conference Proceedings, pages 9-16. UNION Agency –

Science Press, February 2005.

[8] J. Levenberg. Fast view-dependent level-of-detail rendering using cached geometry. In

IEEE Visualization 2002, pages 259-266. IEEE Computer Society, November 2002.

[9] F. Losasso and H. Hoppe. Geometry clipmaps: terrain rendering using nested regular

grids. In ACM Transactions on Graphics: Proceedings SIGGRAPH 2004,

23(3):769-776, August 2004.

97

http://download.nvidia.com/developer/presentations/2004/SIGGRAPH/Shader_Model_3_Unleashed.pdf

http://download.nvidia.com/developer/presentations/2004/SIGGRAPH/Shader_Model_3_Unleashed.pdf

[10] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for

programming graphics hardware in a C-like language. In ACM Transaction on

Graphics: Proceedings ACM SIGGRAPH, 22(3):896-907, July 2003.

[11] G. Moore. Cramming more components onto integrated circuits. In Electronics, 38(8),

1965.

[10] NVIDIA website. http://www.nvidia.com.

[12] R. Pajarola. FastMesh: Efficient view-dependent meshing. In Pacific Graphics 2001,

pages 22-30. IEEE Computer Society, October 2001.

[13] R. Pajarola, M. Antonijuan, and R. Lario. QuadTIN: quadtree based triangulated

irregular networks. In IEEE Visualization 2002, pages 395-402. IEEE Computer

Society, November 2002.

[14] E. Perrson. Accelerating real-time graphics with high level shading languages. Master’s

thesis, Lulea University of Technology, Computer Science and Electrical Engineering

Department, November 2003.

[15] W. J. Schroeder, J. A. Zarge, and W. E. Lorensen. Decimation of triangle meshes. In

Computer Graphics: Proceedings SIGGRAPH ’92, 26(2):65-70, July 1992.

[16] N. Sokolovsky. Combining occlusion culling within the framework of view-dependent

rendering. Master’s thesis, Ben-Grurion University of the Negev, Department of

Computer Science, April 2002.

[17] P. Terdiman. Creating efficient triangle strips. In Coder Corner Website, 2000.

http://www.codercorner.com/Strips.htm.

[18] Wikipedia. Displacement mapping. In Wikipedia the Free Encyclopedia Website,

August 2005. http://en.wikipedia.org/wiki/Displacement_mapping.

98

http://en.wikipedia.org/wiki/Displacement_mapping

http://www.codercorner.com/Strips.htm

http://www.nvidia.com/

[19] C. Woolley. GPU program optimization. In GPU Gems 2, edited by M. Pharr and R.


[20] S. E. Yoon, B. Salomon, R. Gayle, and D. Manocha. Quick-VDR: interactive view-

dependent rendering of massive models. In Proceedings IEEE Visualization 2004, pages

131-138, October 2004.

99

· web viewroam treats a terrain model as a rectilinear elevation map. the algorithm off-line...

Documents