preview of “massive model rendering”

7/31/2019 Preview of Massive Model Rendering

1/91

Real-Time Massive Model Rendering

Jean-Daniel NahmiasSupervised by Anthony Steed

Computer Science Department, University College London

Gower Street, London WC1E 6BT, UK [email protected]

Masters ThesisMsc Vision, Imaging, & Virtual Environments

September 2002

Massive Model Rendering http://www0.cs.ucl.ac.uk/research/equator/papers/Documents20...

of 91 7/5/12 1:27 A


2/91

Real-Time Massive Model Rendering

Jean-Daniel Nahmias

Supervised by Anthony Steed Computer Science Department, University College London

Gower Street, London WC1E 6BT, [email protected]

Masters ThesisMsc Vision, Imaging, & Virtual Environments

September 2002

Abstract

This report discussed various techniques to speed up rendering of large and complexvirtual environments. It explores algorithms such as Hierarchical Spatial Subdivision as wellas Occlusion Culling. The report analyses these algorithms as well as investigates ways of implementing them using recent consumer level Graphics Processing Units.

Disclaimer This report is submitted as part requirement for the MSc Degree in VisionImaging and Virtual Environments at University College London. It issubstantially the result of my own work except where explicitly indicated inthe text. The report may be freely copied and distributed provided thesource is explicitly acknowledged.


of 91 7/5/12 1:27 A


3/91

1 Introduction

1 Introduction

1.1 Organisation

2 Background Work

2.1 Visibility Determination2.1.1 View Frustum Culling2.1.2 Spatial Coherence2.1.3 Temporal Coherence2.1.4 Occlusion Culling

2.2 Level of Detail2.2.1 Progressive Meshes [8]

2.3 Impostors2.3.1 Per-Object Image Warping with Layered Impostors [5]

Figure 13 Example of remaining gap artefacts

3 Development Life Cycles

3.1 Requirements Analysis

3.1.1 Application Feature Set

3.1.2 Rendering Feature Set

4 Design

4.1 Code Base Design4.1.1 Scene Graph

4.1.2 3DS Max Plugin4.1.3 Application Code

4.2 Algorithmic Design4.2.1 1 st Pipeline4.2.2 2 nd Pipeline

4.2.3 3 rd Pipeline

4.2.4 4 th Pipeline

4.2.5 5 th Pipeline

5 Implementation

5.1 Choice of Language

5.2 OpenGL Issues & Getting the most out of the GPU

5.3 Oct-Tree Implementation Details

5.4 Other details addressed after testing

6 Testing and Evaluation

6.1 Testing


of 91 7/5/12 1:27 A


4/91

6.2 Evaluation6.2.1 Windows Timing Functions6.2.2 Evaluating the Rendering Pipeline6.2.3 Finding Bottlenecks

7 Results

7.1 Benchmark System

7.2 Tuning Pipeline 3-5 by Balancing Oct-Tree

7.3 Tuning Pipeline 5 with Temporal Coherence

7.4 Rendering Pipeline Comparisons

8 Conclusion and Further Work

References

Appendix

User Requirements

User Manual

Code


of 91 7/5/12 1:27 A


5/91

1 Introduction

There are various fields including Computer Aided Design and Scientific Visualizationthat require the interactive display of large and complex 3D virtual environments. There is avery strong motivation for the performance of the display of these systems to be high (i.e.Real-Time). The performance of display systems is usually measured in frames per second.This is the total number of images that are sent to the display in one second and is a goodindicator of the performance levels of a virtual reality system. This is not the refresh rate of adisplay system. Various studies have been carried out to find what the lowest acceptableperformance should be in order for users not to be impeded. Different studies come todifferent conclusions, certain performance levels that are considered reasonable for certainapplications are considered insufficient for others. However most developers of such

systems would agree that in general 30 fps per eye is considered adequate and 60 fps isconsidered excellent [45]. When these frame rates are not achieved the results can varyfrom a user wearing a Head Mounted Display feeling the sensation of motion sickness to theuser not achieving the desired productivity from the system and experiencing frustration.

Another motivation for theses frame rates come from interaction. The simplest form of interaction would be an environment that the user can navigate (i.e. walkthrough or flythrough). For this interaction to be continuous the system lag must be below a certainthreshold [10]. This allows the user to interact with the system without having to wait for aconfirmation of his/her actions. One factor determining lag is frame rate the others are inputsampling rate, output sampling rate, and software overhead. Unfortunately computer systems cannot display infinitely complex environment infinitely quickly. Therefore acompromise has to be made between visual quality and rendering speed. A multitude of solutions have been utilized to solve this problem. One example of a very nave solution thathas been used time and time again in the domain of computer games, is for the renderer tosimply draw as much as it can in its allocated time slot starting from the objects closest tothe user and then ignoring the rest. Developers using this technique try to mask it by usingan effect called fog. This gives the user a very short viewing distance and is very noticeable.

If user was to try and navigate a vast virtual environment it would be very difficult for the user to make any kind of navigational decisions when he/she could not see the target. Thisshortfall is just unacceptable today considering the current state of technology in this field.

The last few years has seen an incredible development of high performanceconsumer level Graphics Processing Units. The development of these GPUs has exceededMoores Law. These GPUs as they are currently referred to, also known as a graphics card


of 91 7/5/12 1:27 A


6/91

are now capable of theoretically drawing millions of triangles per second. However in realitythere are other factors that come into play and prevent one from achieving the GPUstheoretical limit. Nevertheless they are still very capable ASICs that have turned what wouldseem impossible just a few years ago into reality today. Even though these cards arecontinuously being updated and improved there is still the need to optimize rendering

algorithms to achieve even greater performance. Users still wish to be able to visualize 3Denvironments that are more complex and have greater levels of detail. However there is stilla lot of room to improve rendering performance and by doing so enabling greater levels of detailed and more complex 3D environments to be displayed in real-time.


of 91 7/5/12 1:27 A


7/91

1.1 Organisation This report will examine ways of improving rendering performance of general 3D

environments. It will do so by looking at techniques for high level optimization as well as lowlevel optimization. The former was achieved with the use of carefully chosen algorithmswhile the later was obtained by careful use of hardware as well as implementation details.Section two will discuss high level optimization; this includes Visibility Determination ,Level of Detail, Impostors . Optimizing is an iterative procedure and therefore thedevelopment cycle of the software produced for this report was also iterative. Section threewill describe the various cycles of development. The design decisions that where madealong the way will be discussed in section four. The design section is partitioned in two;firstly the Code Base Design is described and secondly the Algorithmic Design is

explained. Section five discusses some of the various implementation details. This sectionedis followed by an overview of the Testing procedures as well as the techniques that wereused to Evaluate performance. Section seven presents the results of the different renderingalgorithms as well as some of the parameter tuning. The report then concludes withsuggestions for further improvement in section eight.


of 91 7/5/12 1:27 A


8/91

2 Background Work

This section of the report will focus on some of the current state of the art algorithms

used to optimize visibility determination, level of detail and impostors.

Visibility Determination : The fastest polygons are the ones that are not displayed.These are the polygons that are not viewed by the user. These polygons are the ones thatlie outside the viewing frustum or that are occluded by other objects. Extensive research hasbeen carried out in order to find algorithms that can quickly discard these polygons andtherefore speed up rendering by the fact that less triangles or polygons need to be drawn.

Level of Detail : Another tool that can be used to improve rendering performance is levelof detail. Level of Detail refers to simplifying complex 3d geometric objects. There arevarious algorithms that can achieve this. However once an object has been simplified onelooses the visual quality. This is why these algorithms are generally applied to objects thatare positioned far away from the user. When objects are at a great distance from the viewer,the difference between the original object and its simplified counterpart is greatly diminishedby the fact that when these objects are actually displayed they only occupy a small number of pixels and therefore the lower detail version of the object is indistinguishable from its highdetail counterpart.

Impostors : Impostors are used to fake geometric detail. There are various forms of

impostors. One common impostor is a billboard; this is a 2D representation of a 3D objectthat is then placed into the scene as a textured quad. This quad is then continuously rotatedas the viewer navigates in order to always be facing parallel to the viewer. This is one type of impostor that falls into a general category of impostors referred to as textured impostors.There are also other forms of impostors such as point based impostors. Triangles located far from the viewer may have their projection onto the screen smaller then a pixel. If this is thecase it is then redundant to represent the geometric surface with the use of a triangle, onecan simply use a point based representation for the surface. This may not only speed therendering performance but may also improve the visual quality due to aliasing of triangles.

2.1 Visibility Determination

Visibility determination has been an active research topic that has produced many


of 91 7/5/12 1:27 A


9/91

algorithms over the past decades [30]. It is the process of determining which objects arevisible and which are not. The 3D virtual environment objects presented in this report arerepresented with the use of triangles therefore this report will focus on visibility algorithmsthat determine which sets of triangles are visible at any given point in time. They are visiblebecause they are inside the cameras V iew Frustum and are not occluded by other objects.

One very simple solution to the visibility problem is Hidden Surface Removal Z-buffer [23].The Z-buffer is a buffer that stores the depth values of the currently projected triangles.Before a triangle is rasterized it is project onto the Z-buffer and if any of its projected pixelshave a greater depth value than current values, those pixels are not overwritten andsubsequently not drawn. The Z-buffer offers a simple way of determining which pixels of which triangles should be drawn at the cost of a little amount of memory. However thisoperation is performed almost at the very end of the rendering pipeline. This entails thatshould a triangle not be visible it would still have to be sent down most of the renderingpipeline and therefore wasting resources and impacting performance. This is why theZ-buffer is often used in conjunction with various other algorithms. Computing the exactvisibility can be very expensive, more so than sending the extra triangles through therendering pipeline therefore most algorithms computed the potentially visible set of triangles(PVS). These usually fall into three categories.


of 91 7/5/12 1:27 A


10/91

Exact Visibility: This is the most precise form of visibility computation; the exact visibilitymeans that all visible triangles are sent down the pipeline clipped if necessary and that allnon-visible triangles are culled. This type of visibility algorithm is usually very slow, andrequires a vast amount of pre-processing if used for real-time applications.

Conservative Visibility : These algorithms determine a superset set of the visibletriangles. They remove most but not all of the non-visible triangles. The Z-buffer can then beused to remove the remaining triangles. These algorithms are not as precise as exactvisibility algorithms but usually perform better in terms of speed.

Approximate Visibility : Approximate visibility algorithms are not exact or conservativeand can therefore in certain situations actually remove triangles that would in fact be visible.If these algorithms are well designed, the errors that they produce are in actual fact barelynoticeable. The performance gain of these algorithms can sometimes more then justify their errors. Classification of Visibility Culling TechniquesCohen-Or, Chrysanthou, and Silva [30] use a set of features to classify various visibilityalgorithms, these include:

Conservative vs. Approximate:

Most visibility algorithms are conservative and are used in combinations toproduce exact visibility. Techniques such as the Z-buffer are often referred to as

hidden surface removal algorithms and are used as a back end for other conservative or approximate culling algorithms.

Point vs. Region:

Certain techniques calculate the visibility for a certain view point while othersperform visibility calculations for certain regions of space. Point based algorithmstend to be more flexible than region based algorithm at the cost of performance or pre-processing.

Pre-computed vs. Online:

Online algorithms actually perform the visibility calculations during each renderingloop they may however use certain pre-computed data structures whilepre-computed algorithms compute all the visibility off line. One example of pre-computed visibility is the algorithm use the Quake engine [46]. These


0 of 91 7/5/12 1:27 A


11/91

algorithms usually require a great amount of pre-processing and usually imposegreat constraints on the users navigation and the 3D environment.

Image space vs. Object space:

This refers to where the visibility calculations are performed. Certain algorithms

perform their calculation on the 2D projections of a 3D environment while othersperform their calculations on the 3D objects before they are projected into 2D.

Software vs. Hardware:

Certain techniques are implemented on specialized hardware giving them aconsiderable speed increase. Usually a less efficient algorithms implemented inhardware will outperform a more efficient algorithm implemented on a generalpurpose processing unit.

Dynamic vs. Static scenes:

Static scenes are environments that can be navigated by the user but have therestriction of their subset of objects not being able to move in relation to eachother. This constraint allows the use of spatial coherence to be taken advantage of with certain amount of pre-processing. Dynamic scenes have no such restrictions.

Individual vs. Fused occluders:

Both these terms are used when discussing occlusion culling. Occlusion culling isthe process of determining which objects ( occludee) are obscured by which other objects (occluder) . However certain objects that are not occluders individually canbecome occluders when combined. Occluder fusion is a technique that takesadvantage of this fact.

2.1.1 View Frustum Culling

Visibility algorithms are usually designed around one fundamental principle. Thatprinciple is to remove as much as possible as soon as possible. This reduces the amount of unnecessary processing of triangles along the rendering pipeline. One good method of achieving this is with the use of View Frustum Culling . The View Frustum is defined by thevirtual camera. It represents a volume or region of space that the camera can see. It istherefore a region based visibility algorithm.


1 of 91 7/5/12 1:27 A


12/91

Figure 1 View Frustum

Sutherland and Hodgman [23] developed a 2D algorithm that is very efficient at clippingpolygons against clipping boundaries. This algorithm can trivially be extended to 3D, for more detail descriptions please refer to [23]. It is still unfeasible to clip every triangle againstthe view frustum no matter how efficient the algorithm used might be with scene containingmillions of triangles. However this limitation can be easily overcome using spatial coherence.The following section will describe various hierarchical data structure enabling one to exploit

spatial coherence to speedup view frustum culling. However all of these techniques arebased more or less on the same principles. They work by testing regions or volumes; if these volumes are discarded by the clipping algorithm then all objects contained by thesevolumes can also be discarded. If the volume being tested is fully contained by the viewfrustum then all of its contents can be included, however if the volume intersects the viewfrustum then further testing is required to determine exactly which objects or triangles are tobe clipped. This last case is the worst case. Since most of the scene is usually clipped or visible the penalty for this worst case is greatly out weighed by its benefits. Spatialcoherence usually reduces the number of clipping test. Most of the data structures that willbe presented in the following section where developed to speedup ray-tracing but are beingupdated for VFC. One can find some good reviews of these data structures in the area of ray-tracing research such as [41].

2.1.2 Spatial Coherence


2 of 91 7/5/12 1:27 A


13/91

The following data structures fall into one of two categories; the top-down and thebottom-up approach. The former is based on selecting a region of space and recursivelysubdividing it, while the later works by starting at the bottom of the scene hierarchy andrecursively grouping objects. This section will present some of techniques and datastructures used to exploit spatial coherence starting with the simplest.

Bounding Volumes

These are the smallest geometric volumes encapsulating the object, usually spheresor boxes. Because these are simple geometry shapes they are very easy to testagainst the View Frustum, one can therefore quickly discard objects that are outsidethe View Frustum without the need to test each triangle of the contained object.Spheres have the advantage of quick intersection test while boxes have theadvantage of fitting the geometric objects better (i.e. tighter fit). Therefore reducingthe number of redundent intersection tests. Boxes come in two main varieties Axis

Aligned Bounding Box and Object Oriented Bounding Box [22]. Again there is a tradeoff between cost of intersection and tighter fit. Generally OOBB are able to fit objectsmore tightly than AABB. OOBB can be calculated by using Principal Component

Analysis (PCA), this is a statistical tool that involves building a covariance matrix andthe finding its eigen solution. From the eigen vectors can then be used to find theorientation of the OOBB. The size of the box can then be inferred by finding thelargest distance from the centre of the object to any vertex along each of the eigen

vectors for more information on PCA see [18].

Figure 2 Bounding Boxes

Hierarchical Bounding Volumes


3 of 91 7/5/12 1:27 A


14/91

This again is an extension to bounding volumes. This is a bottom-up approach wheregroups of objects can be formed recursively. Objects closer together can be groupedand encapsulated by another bounding volume and so on. This can yield advantagesin the sense that whole groups of objects can quickly be discarded. However thiscomes at a cost, if a group intersects one of the view frustums clipping planes more

intersection test are required. The deeper the hierarchy the greater the cost of theworst case. Balancing these data structures can be difficult [24].

Figure 3 Bounding Sphere Hierarchy

BSP Trees

Binary Space Partitioning Trees where introduced by Fuchs et. al. [47]. The BSP tree

for a scene is constructed recursively; this is done by selecting a partitioning planeagainst which the scenes triangles are tested. This splits the triangles into twogroups, those that are outside the plane and those are inside. If a polygon intersectsthe plane it is split into two and each part is allocated to its correct section of the tree.This process is then repeated for each subsection recursively. Usually the partitioningplanes contain a polygon from the scene for BSP tree construction. The end result isa tree where each node represents a partitioning plane and the leaves representpolygons. This data structure can be traversed in a front to back order. Starting at the

root node, the tree is traversed recursively starting with the polygons at the far side of the node with respect to the centre of projection than the root node, then the near side. This is based on the idea that objects on the same side of the plane as thecentre of projection or view point cannot be obscured by polygons on the far side.BSP tree can yield significant speed up of static geometry such architecturalwalkthroughs [13].


4 of 91 7/5/12 1:27 A


15/91

Oct-Trees

The Oct-tree is the 3D equivalent of a Quad-tree [ 24 ]. It is constructed by taking an

axis aligned cubic volume that encompasses the whole virtual environment. This cubeis recursively subdivided into eight identically sized axis aligned cubes. If any of thechildren cubes satisfy certain conditions they are then in tern subdivided and so on.These conditions are met if a cube contain a certain number of polygons and thecube itself has not reached a minimum size threshold or the tree has reached amaximum depth. If any polygon intersect the axis aligned cube it can either be split or allocated to both children. Once the tree is constructed the view frustum culling isrecursively performed starting with the root node. If any node intersects the viewfrustum then each of its children nodes is tested.

Figure 5 Oct-Tree

Kd-Trees

Kd-trees, Oct-trees and BSP-trees can be seen as generalizations or specializationsof each other. A Kd-tree can be view as a BSP-tree where each partitioning plane isaxis aligned. It is also recursively constructed by selecting each axis in turn and thenselecting a partitioning plane along that axis to subdivide the space. In practiceKd-trees are look very similar to Oct-trees but with much more optimized boundingvolumes.


5 of 91 7/5/12 1:27 A


16/91

2.1.3 Temporal Coherence

As a user navigates a virtual environment in real-time, most of the time the differencebetween one frame and the next can be marginal. One can exploit this temporal coherenceto achieve culling optimizations. This is performed by assuming that if an object is visible inone frame it is most likely to be visible in the next. Therefore each object that fails to beculled by the view frustum can be assumed to be visible for the next n frames, where is athreshold. However we cannot make the same assumption about objects that did get culled,therefore all objects that were culled in the previous frame must also be tested in the currentframe. The worst possible case for this assumption is if the user rotates 180 degrees; if thisis the case all of the previously objects are no longer visible but would still be sent down therendering pipeline. One could however vary the n threshold depending on the users

movement. One could cancel the previously made assumption if the user moves beyond acertain threshold.

2.1.4 Occlusion Culling

In the previous section we examined how to take advantage of View Frustum Culling andhow to improve it by taking advantage of spatial and temporal coherence. This section willexamine another area of visibility determination based on occlusion culling. This section willexamine one of the more recent occlusion culling algorithms developed by Zhang et. al. [4]known as HOM (Hierarchical Occlusion Maps). This algorithm makes no assumptions of thescene and works very well for densely occluded scenes. The algorithm can be broken downinto two sections;

Construction of the HOMs: this part of the algorithm involves View Frustum Culling of the bounding volume hierarchy of the occluder database. This is followed by occluder selection. The selected occluders are then rendered into an image buffer; this forms the

highest resolution of the occlusion map. From the image buffer a depth estimation map isconstructed. The occluders are rendered without any lighting or texturing to speedup theprocess. The depth buffer is then recursively filtered in order to produce an image hierarchy.This process can be hardware accelerated with the use of mipmap filtering.


6 of 91 7/5/12 1:27 A


17/91

Figure 6 Rendering Pipeline of HOM algorithm

Visibility Culling with the use of the HOMs: Once the HOMs are constructed the

algorithm starts View Frustum Culling the scene database. It then takes the resulting objectsand projects their bounding volumes. The algorithm then performs an overlap test betweenthe project bounding box of a potential occludee and the hierarchical occlusion maps. If aprojection completely falls within an opaque region of the HOMs a depth comparison is thenperformed in order to determine whether the potential occludee is actually occluded.

Figure 7 Example of Hierarchical Occlusion Map

For a more detailed explanation of this algorithm please refer back to [4]. This algorithm hassome interesting properties. It is an image space algorithm that implicitly performs occluder fusion with the HOMs. The overlap test is optimized by the hierarchy and it also supports aconservative early termination of the overlap test. Again many of the previously mentioned


7 of 91 7/5/12 1:27 A


18/91

techniques to speedup View Frustum Culling can be applied here. Temporal coherence canalso be applied for the occluder selection. HOM was developed as an alternative toHierarchical Z-buffers for a detailed comparison between the two algorithms please refer back to [4] and for an overview of Hierarchical Z-buffers please refer to [12].

Figure 8 Occluder Fusion

2.2 Level of Detail

When rendering incredibly large datasets even the best visibility determination algorithmsmay not suffice. In these situations one must look for alternatives. One such alternative is

Level of Detail algorithms. These algorithms purpose is to simplify 3D meshes Oliveira et. al.[44]. Level of Detail algorithms work by taking the highest detail mesh and reducing thenumber of polygons used to represent that mesh in such a way as to preserve the topologyof the mesh. The topology of the mesh defines its general appearance. Good algorithmsobtain this by using a combination of error metrics. If the topology of the meshes ispreserved then lower detail mesh will appear almost indistinguishable from its higher detailequivalent when projected from a distance. Like visibility algorithms, Level of Detailalgorithms can fall into various categories. Eriskon et. al [35] classified these algorithms into

three basic categories. Geometry removal

This refers to simplifying a mesh by removing vertices or collapsing edges from itrepresentation.

Sampling


8 of 91 7/5/12 1:27 A


19/91

Here a lower level of detail mesh is reconstructed from the sampled data of thehigher detail mesh.

Adaptive subdivision

The different level of detail meshes are constructed by starting with a rough modelthat is recursively refined by adding detail locally where needed.

When examining various Level of Detail algorithms one can deduce that there are also other categories that these algorithms can fall into, these include:

Continuous vs. Discreet

The most noticeable artefacts created by level of detail algorithms are usuallyexhibited when swapping the representation of an object from one level of detail toanother; this can create popping or gaps. When approach to minimizing this effectinvolves blending from one level of detail to another, continuous algorithms such as

progressive meshes [8] avoid this problem. When we refer to continuous Level of Detail with regards to a polygonal representation we generally mean that any level of detail can be obtained by sampling at any point in the space domain of the algorithm.Therefore an infinite number of meshes can be created between any two distinctlevels of detail. In practice however we are limited to the displays that are discreetapproximations. View Dependent vs. View Independent

Certain level of detail algorithms produce different results based of the viewingposition while others dont. View dependent algorithms can produce more pleasingresults at the cost of extra resources. Pre-processed vs. Online

Some algorithms such as progressive meshes [8] pre-calculate all the possible meshsimplifications while others perform their calculations online. Screen-Error Bounded vs. Screen-Error Unbounded

Screen errors are the projected errors. This is the difference between the projectedsimplified mesh and the projected high detail mesh. This error metric is much moreaccurate with respect to the perceived result. It also has the advantage of allowing

users to select their own preference with respect to the visual quality vs. performancetrade off. It can also be used to reduce temporal coherence aliasing artefacts. Global vs. Local

Global algorithms apply a change globally across the whole mesh while localalgorithms might apply changes to local features.


9 of 91 7/5/12 1:27 A


20/91


21/91

Figure 10 Mesh at different levels of detail


1 of 91 7/5/12 1:27 A


22/91

2.3 Impostors

Impostor algorithms exploit image data from recently rendered frames. This data is

reused in subsequent frames to avoid the expensive re-rendering of the complete 3D scene.This effectively fakes geometric surfaces with the use of an approximate representation.One of the most common representations used to achieve the desired result is an opaqueimage textured onto a transparent polygon. These types of impostors are known as texturedimpostors. There also other types such as Point based impostors [7] or line basedimpostors. There are many different impostor algorithms and like most algorithms examinedby this report they can be classified into various categories.

View dependent vs. View independent

View independent impostors can be used from any given view point while viewdependent impostors need to be updated for different view points.

Dynamically generated vs. Pre-calculated

Dynamically generated impostors are calculated on the fly (as they are needed)while precalculated impostors as their name implies can be computed off line.

2.3.1 Per-Object Image Warping with Layered Impostors [5]

Single Layer view dependent impostors have the disadvantage of not containing anydepth information. This makes them very vulnerable to errors. They therefore cannot beused for points of view that vary to any significant extent from where the image was taken.To avoid serious visual degradation one must either dynamically update them very often or one must pre-calculate a great number of impostors from various view points and swapbetween them frequently. Another alternative to these is to use Multi-Layered impostors. Asopposed to a single layer impostor that contains a single transparent polygon onto which animage is mapped to, multi-layered impostors consist of many of these transparent polygons.

Every layer represents a slice of the object at varying distances from the user. Using agreater number of layers produces a better approximation.


2 of 91 7/5/12 1:27 A


23/91

Figure 11 Example of Multi-Layered Impostors

As opposed to single layered impostors, multi layered impostors are stored using RGBzwhere z is the depth information. Most of todays graphics hardware support RGBA where Ais the alpha component, this alpha component can be used to store the depth information.When these impostors are rendered multiple semi-transparent polygons are rendered as apyramidal stack with the view point at the apex. For each layer only the areas where thestored z component closely match the distance of the view point to the layer, are drawn. To

avoid cracks slightly overlapping depth intervals are drawn onto every layer. See figure 12.

Figure 12 Example of Overlap in Multi-Layered Impostors

The advantage that multi-layered impostors have over single layered ones is accuracy.Multi-layered impostors have a much longer life span due to the fact that view point canmove greater distances without producing as much error as single layered impostors. Theydo however have the draw back of taking longer to generate. Another disadvantage of thisalgorithm is that it produces aliasing artefacts in the form of gaps for objects that areoccluding. Please refer to [5] for a more detailed description of this algorithm.


3 of 91 7/5/12 1:27 A


24/91

Figure 13 Example of remaining gap artefacts


4 of 91 7/5/12 1:27 A


25/91

3 Development Life Cycles

Figure 14

This project focuses on optimizing real-time rendering with the use of various techniquesto achieve this goal. However optimization is an iterative process. One must optimize the

appropriate areas of the rendering pipeline in order to achieve the maximum possiblebenefits. Please refer to section 4.2 to see the different rendering pipelines that were createdduring the development cycles. The optimization was initially performed at a high level andthen progressively refined at lower levels. It would be very costly in terms of time resource tofocus on optimizing a certain algorithms implementation and then to come to the conclusionthat another algorithm altogether would achieve far superior results. Due to the nature of thisproblem the development of this software went through several cycles or iterations. Eachcycle can be broken down into the same number of steps.

Requirements Analysis

Design

Implementation

Testing

Evaluation

The following sections of this report will focus on each one of these steps and their variousiterations. Each one of these steps produces information that is passed onto the following

step. Once the last step was reached one iteration is completed and the results were thenused to start the next iteration until satisfactory results were achieved. Note that initially thefirst iterations had a tendency to focus more on the design, these where usually high leveloptimizations while the last iterations that where low level in nature focused more onimplementation. This type of Life Cycle is often referred to by software engineers as theprototype approach to developing software. This is often used when the initial requirements


5 of 91 7/5/12 1:27 A


26/91

are not clearly defined. In this case the high level requirement is simple, To render large andcomplex 3D environments in real time. However the more low level requirements are notinitially defined. For research oriented projects prototyping approaches are often employedand this suited the needs of this project perfectly.

3.1 Requirements Analysis

Before embarking on any form of design a feature set of the application had to beestablished. One subset of the application feature set is the rendering feature set. Becausethe goal of this project was to explore how to improve rendering performance of geometrically complex environments the rendering feature set was kept to an adequateminimum.

3.1.1 Application Feature Set

Rendering Large and Complex 3D Environments

Navigation of these Environments

Benchmarking Capabilities

3.1.2 Rendering Feature Set

Simple Lighting (Point & Directional Lights)

Material Properties

Texturing

Geometric Objects

This was the initial rendering feature set, as the development progressed this feature setwas expanded to encompass all the acceleration techniques.


6 of 91 7/5/12 1:27 A


27/91

4 Design

This section of the report will not only look at the design of the software but will study

some of the design decisions affecting implementations that were made. Due to the natureof this work not all the design could be completed initially and some of it changed during thedevelopment. Good design at the beginning of a project can save a lot of unnecessary worklater on. The design can be broken down into two sections; Algorithmic Design and CodeBase Design . Most of the code base design could be done at the outset while the

Algorithmic design went through a few iterations and changed along the way. Certainconsiderations that could potentially influence the design had to be taken into account duringthis phase of development. These had a strong impact on the choice made during thisproject. One of these was the target platform. The target platform for this project wasconsumer level hardware not a high end special purpose SGI Onyx. To be more precise thetarget platform was a PC with a good consumer level openGL capable 3D GPU (GeForce3or higher), a Pentium IV or equivalent CPU and 512MB of RAM. It would be unrealistic toexpect to achieve anything near real time performance when rendering models containing atleast 1 million+ polygons without the use of hardware acceleration. For this project thishardware acceleration came in the form of a GeForce3. However we are still waiting for afully programmable GPU and have to rely on the currently available features provided by thecurrent crop of GPUs. This imposes certain limitations in terms of what algorithms can be

hardware accelerated. With this in mind certain algorithms were chosen over others on themerit that they could be hardware accelerated or that they could leverage the hardwareacceleration. Another underlying thought that influenced certain design decisions wasportability. This software is not portable to other platforms, however it was designed andimplemented with certain care in order to potentially ease the transition to another platform.

4.1 Code Base Design

It was important to develop a strong code base that could be built upon and easilyextended. The code base was also designed with simplicity in mind. This code baseincluded the data structures of the application, certain basic tools as well as the mainapplication functionality. An Object Oriented approach was used to design the code base.The core of the code base was the scene graph. This is the set of objects that represent the


7 of 91 7/5/12 1:27 A


28/91

3D environment. Scene Graphs are common to all 3D rendering applications. They usuallyshare a lot in common. As opposed to using some of the more commonly available opensource scene graphs such as Open Scene Graph, a simple one was developed specificallyfor this project. This added the benefits of flexibility as well as simplicity. Benchmarking wasalso a fundamental part of the code base. To improve the software application, appropriate

benchmarks had to be performed. The benchmarks were performed by recording camerapaths and then playing them back while recording the time spent for each frame see Testingand Evaluation Section 6. Another important tool that was developed as part of the codebase was a 3D Studio Max Plugin. There are a lot of commonly used 3D file formats (VRML,DXF, 3DS, OBJ, etc ) supporting most of them directly would be a great undertaking. 3DStudio Max is a very successful 3D authoring package that supports most 3D file formats.Therefore by developing a 3DS Max export plugin that would export any scene one wouldimplicitly support all the formats supported by this package.


8 of 91 7/5/12 1:27 A


29/91

4.1.1 Scene Graph

The scene graph was kept to a minimum and extended as needed. It supports Lights,Cameras, Geometric Objects, and Materials. The following will describe all the classes usedto represent the scene graph through their iterations of design.

Figure 15 Scene Graph Hierarchies

CameraVector3f View Reference PointVector3f View Up Vector Vector3f View Plane Normal

VF View FrustumThis class represents the view of the user. It also supported writing its state to a file. This

was used to record camera paths to a file that where later used for benchmarking. User Navigation was support by this classs methods. This class also contained the View Frustumclass. View FrustumPlane TOP, BOTTOM, FRONT, BACK, LEFT, RIGHT This class represents the View Frustum used for certain visibility calculations. It containssix planes to define that viewing volume.

GeoObject


9 of 91 7/5/12 1:27 A


30/91

Vector3f* VerticesVector3f* Vertex NormalsVector3f* Bounding Box VerticesVector2f* UVsInteger* Triangle Indices

Material* MaterialMatrix4f Current Transformation Matrix

This class is used to represent the geometric objects occupying the virtual world. Thegeometric information is represented by indexed triangles. There are many other datastructures that can be used for this purpose such as the winger edge data structure [23]. For the purpose of this report indexed triangles sufficed. The objects location can be determinedby the use of the current transformation matrix [23]. This data structure also containsbounding boxes see section 4.2.2. MaterialColour Specular Colour DiffuseColour AmbientFloat TransparencyFloat ShininessChar* Map Path

Integer IdChar* Material Name

This class is used to describe the material properties of each object. It contains all thenecessary data to implement phong shading [23]. The map path is used if the material has atexture associated with it.

LightVector3f Light PositionVector3f Light Direction (used if light is directional)Colour Light Colour


0 of 91 7/5/12 1:27 A


31/91

Integer Type (i.e. Directional or Point)Float Intensity

One of the rendering features was to have simple lighting. This class provides support for this.

Oct-Tree NodeVector3f* BoundaryOctree Node Children Nodes (Pointer to eight children nodes)Integer Intersecting Object IndexInteger** Intersecting Triangle IndexInteger Depth of tree

This class was used to represent the Oct-Tree. The Boundary defines the cube for thatnode. Each node contains a pointer to its children should it have any. They also containpointers to all intersecting geometric objects. For each of these geometric objects the nodecontains an index into all of its intersecting faces. The depth of each is also stored seesection 2.1.2. Scene GraphLight LightsMaterial Materials

GeoObject Geometric ObjectsOct-tree Node Oct-TreeCamera Cameras

The scene graph contains the lights, cameras, materials, geometric objects and theOct-Tree. This class is used to manage all these objects and provide a few necessaryutilities. It can read and write all its data to a file store. For a much greater detailed description of all these classes please refer to the Appendix.


1 of 91 7/5/12 1:27 A


32/91

4.1.2 3DS Max Plugin

In order to view various 3D models stored in a multitude of different file formats, a3DS Max Plugin was developed. 3DS Max is a professional 3D authoring package thatsupports a very large feature set. It also comes with an SDK that allows developers toextend its capabilities by developing different plugins. These plugins range from advancedrenderers to animation tools. The plugin developed for this report was an exporter plugin thatallows any scene currently loaded in 3DS Max to be saved to a file directly supported by thescene graph for this renderer. The plugin supports a subset of the Max scene graph. Itsupports materials as well as one diffuse texture map. It also supports basic lights andcameras as well as geometric objects. 3DS Max however supports a wide variety of differentrepresentations for geometric objects. These include indexed face sets, curved surfaces in

the form of NURBS, Bezier Curves and various others. This plugin takes any one of theserepresentations and converts them to an indexed triangle list before saving them to disk. Theplugin is very simple and is used by entering the file menu and selecting export. The user then selects the format, filename and location to which he/she wants to save the file. Havingselected the format designed for this renderer the user is then presented with a very simpleoptions menu. This menu allows the user to select which parts of the scene to export, it alsohas a debug option that was implemented for testing purposes.


2 of 91 7/5/12 1:27 A


33/91

Figure 16 Screen Shot from 3DS Max Plugin

4.1.2.1 3DS Max Plugin File Format

The file format used to store the Scene Graph was designed along with the 3DS Max

plugin. It is an ASCII format, this was chosen because they can be easily modified by astandard text editor, no care has to be taken with regards the endienness of the platform andit can very easily be debugged. The draw back of ASCII is the space required to store thesetypes of files is greater than a binary equivalent and they take a greater amount of time to beread and written. The structure of file is however very simplistic and is as follows. Example File *.JDNNumber of Materials = 1Material Name = Sky

Ambient = 0 0 0

Diffuse = 0.2 0.552941 0.176471Specular = 0.9 0.9 0.9Shine = 0.25Shine Strength = 0.05Transparency = 0Material ID = 0Map Path = C:\3dsmax4\maps\Skies\Duskcld1.jpg

:::


3 of 91 7/5/12 1:27 A


34/91

Number of Lights = 1Pos = -53.9952 455.163 0Direction = 0 0 0Colour = 1 1 1Intensity = 1Fall = 45Type = 2

::

:Number of Objects = 1Material ID = 0Num Vertices = 8Num Faces = 12Num textured vertices = 12Vert = 44.89 92.909 0

:::

Vert = 78.8883 171.119 60Face = 0 2 3

:::

Face = 4 6 2Textured Vert = 0 0 0

:::

Textured Vert = 1 1 0Textured Face = 9 11 10

:::

Textured Face = 3 2 0Face Normal = 0 0 -1

:::

Face Normal = -1 0 0Vert Normal = -1 0 0

:::

Vert Normal = 0 1 0


4 of 91 7/5/12 1:27 A


35/91

4.1.3 Application Code

The application code handles the application initialisation, the user interface and alsocontains the main rendering loop.Application InitializationThis part of the application consists of initializing the hardware and the Scene Graph. Oncethe Scene Graph has been instantiated it then reads the file containing the 3D scene data.Once this has been completed the necessary data structure pre-processing occurs.User InterfaceThe user interface is a simple one. The navigation of the environment is performed with flythrough interaction. The mouse controls the viewing directions while the keyboard allows theuser the move forwards or backwards along the viewing direction. Other keys allow a

strafing action. The user can also enable certain viewing cues to be enabled. These includethe visualization of certain data structures used to perform visibility operations. Thesestructures are not actually part of the 3D environment. Lastly there are two more keys. Oneof them allows the user to record a camera path that is subsequently stored to the filesystem. The other allows the user to playback a recorded path while engaging abenchmarking feature that stores the results in a benchmark file. The rendering loop and the pre-processing is discussed in the algorithmic design section 4.2of this report while more detail with regards to benchmarking is presented on the evaluationand result section 6&7.

Figure 17 Initialization

4.2 Algorithmic Design


5 of 91 7/5/12 1:27 A


36/91

Figure 18 Rendering Pipeline


6 of 91 7/5/12 1:27 A


37/91

The above diagram shows the various iterations of the rendering pipeline. Please

note that this is a very simplified representation of each pipeline and is by no meanscomplete. However it does illustrate the differences between each pipeline as well as theevolution the pipeline went through in order to achieve real time performance. Each

rendering pipeline is spilt in two; the first part represents the calculations performed by theCPU while the second part represents the calculations performed by the GPU. As GPUs areevolving they are taking more of the burden off the CPU. This section will review everyiteration of the rendering pipeline and discuss the different algorithms selected to performdifferent pipeline tasks. It will also justify all the decisions that were made. These decisionswhere made with the careful use of evaluation tools. The evaluation techniques will becovered in greater depth further on in section 6.

4.2.1 1 st Pipeline

The GPU can perform visibility with View Frustum Culling. The GeForce3 alsosupports a certain level of occlusion culling, this hardwired into its architecture. Finally thehidden surface removal is performed with the Z-Buffer. The first step was just to read thescene graph and throw all the triangles across the AGP bus and see what happens. Asexpected this yielded very poor results. It was quite obvious from the benchmarks see

Appendix that too many triangles were being passed across the AGP bus and into the GPU.

Careful care was taken to minimise the state changes that would have to occur in the GPUin order to setup Materials. Therefore all the objects where sorted by material.


7 of 91 7/5/12 1:27 A


38/91

4.2.2 2 nd Pipeline

One very simple solution to reducing this burden was to remove as much as possible

as soon as possible. However there was no point in implementing as exact View FrustumClipping algorithm when the GPU could perform this task much more efficiently than theCPU. The easiest solution was to implement conservative View Frustum Clipping seeSection 2.1.1 with the use of Bounding Volumes. For this task an Axis Aligned BoundingVolume [section 2.1.2] was created for every geometric object in the scene graph. Thesebounding volumes where then tested against the view frustum. If any intersections with an

AABB succeeded the whole geometric object was then passed onto the GPU. Thisproduced much better results then the previous version of the rendering pipeline see Section7. This was an improvement however it was nowhere close to achieving adequate real-timeperformance. The target scene contained in excess of six thousand objects. Testing eachone of these objects bounding volumes was too costly. The evaluation of this pipeline led tothe conclusion that the rendering loop was CPU limited see Section 6.

4.2.3 3 rd Pipeline

Having come to the conclusion that the pipeline was CPU limited the CPU

conservative VFC had to be optimized. It was time to start taking advantage to spatialcoherence. This could potentially reduce the CPU burden greatly. A decision between whichspatial subdivision structure to use had to be made. With a little foresight it was clear thateven spatial subdivision would not suffice and occlusion culling would eventually have to beused. The spatial subdivision algorithm would have to be useful to both VFC and OcclusionCulling. Two structures satisfied both those needs perfectly these included Oct-Trees andK-DTrees. The Oct-Tree data structure was selected because of its simplicity and elegance.The construction of an Oct-Tree data structure [Section 2.1.2] is very simple and the onlyparameter that has to be set is the depth at which the tree stops. This depth can depend ontwo factors; the minimum amount of triangles contained by any one node and the minimumsize that a node can reach. These two factors can affect the performance of both the VFCand the Occlusion Culling more on this in Section 7. There are also small implementationdecisions that can be made when implementing Oct-Trees [Section 5]. The Oct-Treecreation was performed as a pre-processing step. This occurred straight after the scene


8 of 91 7/5/12 1:27 A


39/91

graph had been read. A conservative Hierarchical View Frustum Culling algorithm wouldwalk the tree and remove all the non-visible nodes. This produced very good results. Theperformance of this algorithm produced a variable frame rate. In certain situations it yieldedvery adequate frame rates while in others the frame rates dropped considerably. At this pointin the development most of the triangles that where not contained by the View Frustum

where culled very early in the pipeline. However there where still triangles that where notvisible due to occlusions, that were still being sent down most of the pipeline until they wereremoved by the Z-buffer.

4.2.4 4 th Pipeline

It was clear from the previous evaluation that the geometric load was still too greathowever the previous rendering pipeline was not fill rate limited [see Section 6]. Thisindicated that some of the fill rate could be sacrificed in order to reduce the geometrythroughput. This is referred to as balancing the rendering pipeline between fill rate andgeometry throughput [see Section 7]. The goal of this iteration was to achieve OcclusionCulling [see Section 2.1.4] and therefore further reducing the geometric load on the GPU. AnOcclusion Culling algorithm had to be employed. The first choice was which type of occlusion algorithm should be selected. Because of this extra fill rate an image spaceocclusion algorithm was selected over an object space one. This not only enables therendering pipeline to be balanced but image based algorithms have a tendency to be much

more flexible in terms of the information they can be given as well as implicitly supportingocclusion fusion. There was still a limiting factor, imposed by the architecture of the targetplatform. This was the AGP bus, unfortunately the AGP bus can only achieve highthroughput in one direction. This means that one can feed a lot of data to the GPU but it isincredibly slow at retrieving data from the GPU. The problem with some of the previouslydiscussed Image base Occlusion Culling algorithms is that they require the reading back of the Z-buffer from the GPU. Doing this for each frame would have a very serious impact onperformance. An alternative had to be found. This alternative came in the form of anOpenGL extension supported by NVidia and various other manufacturers that will soonmake it into the OpenGL 2 standard. This extension allows queries to be sent to the graphicscard, these queries return the number of Z-buffer pixels that were written to by a set of primitives. I.E. one can create a query then proceed to draw a triangle and subsequentlyobtain the number of pixels that where projected from that triangle onto the depth buffer. For a pixel to be written to the depth buffer it would have to have a lower depth value (i.e. be in


9 of 91 7/5/12 1:27 A


40/91

front). With this extension in mind the following algorithm was implemented. This algorithmconsisted of firstly performing Hierarchical View Frustum Culling of the Oct-Tree. TheZ-Buffer from the previous frame was kept. Then occlusion queries where generated, thesequeries were used in conjunction with the rendering of every Oct-Tree node that had notbeen culled previously. Once these nodes had been rendered the queries would return

exactly which nodes were visible, (the Z-buffer is then cleared) the triangles contained in thissubset of nodes were then selected and rendered. It is important to note that this is anapproximate visibility algorithm. This means that in certain situations it will not render objectsthat in actual fact should be visible. This seems worse then it actually is. The worst possiblecase is for the camera to rotate 180 degrees about its View Up Vector and for all the objectscontained within its newly placed viewing frustum to be further away then all of thepreviously rendered objects and for the previously rendered objects to have taken up mostof the Z-buffer. This would create an almost empty frame where very few objects would bevisible. However it would only happen for one frame, the following frame would clear theprevious Z-buffer and all the visible objects would reappear. In practice this very rarelyoccurs. The faster the frame rate the less time that worst case frame would stay present onthe screen and the less noticeable this aliasing would be. This potential error could also becapped to a maximum error by introducing some pre-emptive mechanism that would detectvery extreme movement from one frame to the next and clear the Z-buffer should such anevent occur. The advantage this approach has over a more conservative equivalent is thatthis algorithm does not need to perform any kind of occluder selection. Therefore nooccluder pre-processing steps are required and no resource are utilised in determining

which object should be selected as occluders. I would also like to mention that this algorithmwas originally implemented with some skepticism but was found to produce very goodresults.

4.2.5 5 th Pipeline

The evaluation of the 4 th pipeline yielded very good result real-time performance withthe test scenes were achieved. However it did introduce a fill rate limitation. This was due tothe fact that all these Oct-Tree nodes not removed by the Hierachical View Frustum Cullingwhere being rendered for each and every frame. One way to improve this limitation was touse temporal coherence [see Section 2.1.3]. This can be achieved by considering Oct-Treenode visible for certain number of frames. By doing so it does need to be rendered for occlusion queries for the specified number of frames. If its contents are in actual fact notvisible it is the responsibility of the View Frustum Culling on the GPU to discard. This allows


0 of 91 7/5/12 1:27 A


41/91

greater tuning of the graphics pipeline and enables a much more fine grain control whenbalancing fill rate vs. geometry throughput [see Section 7]. Another very simple way todouble the fill rate was to introducing Back Face Culling, this is the process of removing alltriangles that are facing away fro the camera. This very easily implemented with the use oneOpenGL function call. It does put the constraint that double sided surfaces are no longer

supported. Unfortunately the nature of the scenes used to test this pipeline is architectural inorigin and subsequently the geometry describes more the topology of the objects and to amuch lesser extent surface detail. This has a tendency of making level of detail less usefulbecause further reductions of triangles alter the topology. A little experimentation was carriedout and produced very poor results. An alternative had to be found. The only alternative leftwas impostors. Texture impostors are very prone to serious aliasing artefacts and for thatreason a point based approach was selected. This was also very well suited to the previouspipeline. As one may recall extensive use was made of one OpenGL extension that enabledthe number of Z-buffer pixels that where touched to be retrieved. This could also be used todetermine which object representation to use. A threshold could be set; this would determinewhich representation of an object to use. If the threshold was exceeded the original objectrepresentation would be used otherwise a point impostor would replace it. However due totime constraints the Point Based impostor algorithm was not completed and therefore could

not be tested. Therefore the only difference separating the 4 th and 5 th rendering pipeline wasthe temporal coherence technique. Pseudo Code for Final Rendering Loop

ClearColourBuffer();pCamera->look(); // Update Camera LocationpCamera->m_pFrustum->calculateFrustum(); // Calculate new view frustumpOctree->CullOctree(pptrNodes,*(pCamera->m_pFrustum)); // Hierarchical View Frustum CullingglGenOcclusionQueriesNV(g_NumVisibleNodes, ptrOcclusionQueries); // Generate Occlusion QueriesglDisable(GL_LIGHTING); // Disable LightingglColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE); // Disable Colour Buffer glDepthMask(GL_FALSE); // Disable writing to Z-Buffer for ( int i=0 ; iDrawAABB(); // Render Bounding Box the box was not visible 3frames agoglEndOcclusionQueryNV(); // End Occlusion Query

}glEnable(GL_LIGHTING); // Enable LightingglColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE); // Enable Colour Buffer glDepthMask(GL_TRUE); // Enable write access to Z-Buffer glClear(GL_DEPTH_BUFFER_BIT); // Clear Z-buffer for ( int j=0 ; j 0 && < 20 ){


1 of 91 7/5/12 1:27 A


42/91

pptrNodes[j]->DrawImpostor(); // If pixels touched smaller than 20 draw impostor }else{

pptrNodes[j]->DrawNode(); // Otherwise draw true representation}

}glDeleteOcclusionQueriesNV(g_NumVisibleNodes, ptrOcclusionQueries); // Delete Queries

5 Implementation

This section will focus on implementation details. It is very important to take extreme carewhile implementing any kind of system that requires high levels of performance. The key tothe success of this project was selecting good algorithms that could truly harness the power of the GPU. It is therefore necessary to fully understand the underlying architecture of the

platform. With this understanding one can specifically tailor certain aspects of theimplementation to take advantage of the hardware or in other cases avoid some of itspitfalls.

5.1 Choice of Language

Gone are the days when highly optimized rendering loops where completely implementedin assembly. Developing in assembly allows the developer full unrestricted control over thehardware. However as compilers have evolved and can produce optimized assembly thereis no longer the need to develop at such a low level most of the time. The average developer would not be capable of matching a compilers performance when it came to developing a fullblown application in assembly. However there are some exceptions to the rule. This renderer was fully developed in C++. This language was selected because it is Object Oriented whichmatches the design, its easier to port, it has now reached a certain level of maturity andthere are many good compilers for it that can produce optimized code. Some may argue thatstandard C can yield better performance but it is my opinion that generally if care is taken

when implementing software in C++ this is not the case. The advantages that C++ offers interms of productivity greatly outweigh its disadvantages.


2 of 91 7/5/12 1:27 A


43/91

5.2 OpenGL Issues & Getting the most out of the GPU

To harness the power of the GPU one must be able to interface with it. This functionality

is provided by an API. There are really only two to choose from; OpenGL vs. DirectX. Thedistinction between them is gradually fading away. DirectX has finally reached a certain levelof maturity that OpenGL obtained a while ago. The choice was simple; once again portabilitycomes to mind. OpenGL has much wider support for a variety of different platforms.However with the very rapid evolution of GPUs the OpenGL standard has not had the timeto catch up. This has led to a multitude of different OpenGL extensions each supportingdifferent GPUs. Unfortunately using these extensions had become unavoidable. There ishope with the new OpenGL 2 standard which hopes to unify most of these conflicts. Toachieve good performance levels one must be very careful in choosing which OpenGL

function to use when drawing triangles. OpenGL supports a whole variety of different waysof drawing triangles these include:

Immediate ModeDisplay ListsVertex BuffersCompiled Vertex BuffersVertex Array Range (NVidia Extension)

All of these methods with the exception of compiled vertex buffers were implemented withvery different results. The immediate mode is the easiest to use and probably the mostflexible however this comes at a very high price. Its performance is slow see Section 7 . Thereason for this is that too much information has to be sent across the AGP bus. Thisincludes many OpenGL commands at least three usually four per triangle. One for eachvertex and one for the vertex normal. This can be greatly reduced by wrapping thesecommands into a display list that can then be cached by the GPU. Display Lists improveperformance to a small extent at the cost of only being able to use them for static geometry.

Vertex Buffers improved the performance very substantially. This is because much lessinformation has to be sent across the AGP bus and shared vertices only have to be sentonce. However the fastest way to draw triangles on NVidias GPUs is to use one of their extensions VAR Vertex Array Range. This allows one to cache all the vertices either on theGPU itself or in the AGP memory. Because this projects focus was on rendering largedatasets it was not possible to cache everything on the GPU so AGP memory was used.


3 of 91 7/5/12 1:27 A


44/91

Once all the vertices where cached the indices of the visible triangles where sent to theGPU. This allowed the GPU to DMA what it needed directly from the AGP memory andyielded substantial performance boosts see Section 7.

5.3 Oct-Tree Implementation Details

There where certain details that had to be addressed when implementing the HierarchicalOct-Tree Culling . Firstly during the pre-processing stage when the Oct-Tree is constructedcertain triangles intersect two or more nodes. A decision had to be made on how to deal withthese triangles. They split into one or more triangles each associated with its appropriatenode. This would however greatly increase the number of scene triangles something thatwould not be desired. The alternative would be to associate the triangle with more than onenode. To avoid drawing the same triangle multiple times these types of triangles could beflagged each time they are rendered for each frame and therefore avoiding overdraw. Whenbuilding the Oct-Tree there is one crucial decision that has to be made. This decisioninvolves determining when stop subdividing the Tree. One trivial case is to stop subdividing if a node does not contain any triangles. However it would be very costly to build a tree whereeach node contained one triangle. This would also defeat the purpose of the tree. Oct-Treesare usually subdivided until each contains no more than certain amount of triangles or thenode has reached a minimum size. The thresholds used for this determine how the final treeis balanced. Obtaining a good balance is also crucial. This balance is a trade off between

VFC, Occlusion Culling efficiency as well as GPU cache hits. This balance is quite tricky toachieve and also very dependent on the type of scene being rendered see Section 7.Sparse Scene where the objects are quite spread out will favour a coarser Oct-Tree whilevery dense scenes will favour a finer grained Oct-Tree. Then again if the Oct-Tree is too finegrained this will slow down the VFC and minimize vertex cache hits of the GPU, whileobtaining more accurate Occlusion Culling. Since building the Oct-Tree is an offline processthe best way to balance the tree would be to perform some kind of cost benefit analysis for various parameters and a particular scene. There are also small implementation details withregards the Hierarchical View Frustum Culling of the Oct-Tree that can yield minor performance advantages. Obtaining a back to front ordering of the tree will improve therendering performance due to the fact that modern GPUs also have some kind of OcclusionCulling in their pipeline. Some other small advantages could be gained by not testing thechildren of a node that falls completely within the View Frustum as well as testing only asubset of a nodes children depending on which vertices of the parent node were containedwithin the Frustum. All of these are small examples of simple implementation details that


4 of 91 7/5/12 1:27 A


45/91

help squeeze that little bit of extra performance out of the culling algorithms.

5.4 Other details addressed after testing

This part of the report will focus on the little details that had to be addressed during theimplementation. Most of these only became apparent after testing. To test the system amodel of the Millennium Dome was used. This model was more or less polygon soup. Thispresented its own challenges. Firstly the ordering of the triangles was not consistent. Thismeant that consistent back face culling could not be achieved. Given the size of this model1million+ triangles it was not feasible to fix this model by hand. Fortunately enough [26]developed an algorithm that could automate this process. This could also be used to fixnormals that were pointing in opposite directions and therefore producing certain lightingartefacts. Another issue that quickly came to attention was one of numerical accuracy.Careful consideration when constructing bounding boxes had to be taken. It also createdtearing artefacts caused by the low precision of Z-Buffers on consumer level graphic cards.This could be remedied in various ways; firstly the viewing volume should be as small aspossible while still containing all the visible objects. This can be achieved by dynamicallymoving the front and back clipping planes. It could also be achieved by firstly rendering thefront half of the scene and subsequently rendering the second half after having appropriatelychanged the clipping planes. This would however turn the rendering pipeline into a two passpipeline. It was found that by moving the front clipping plane as far as possible combined

with moving the back clipping plane as close as possible used in conjunction with the NVidiaVAR extension and Back Face culling eliminated most if not all the tearing artefacts.

Figure 19 above example of tearing artefacts, below example of triangle ordering artefacts


5 of 91 7/5/12 1:27 A


46/91


6 of 91 7/5/12 1:27 A


47/91

6 Testing and Evaluation

6.1 Testing

Various tests where performed to verify algorithms and their implementations. Two testscenes were utilised for this project. The first was a relatively simple scene of islands and asailing boat. This was a small scene that could be quickly loaded. It was used mainly to testall the small changes that were progressively made to the code base. If everything workedwith this scene the large Millennium Dome scene was then used. This scene was stored in alarge file that would take a few minutes to load and pre-process. Some of the testing wasperformed with visual cues. One example of this was when generating AABB and Oct-Treesthey would be displayed on the screen. Other tests involved rendering various camerapositions with different algorithms and comparing there results. The use of variousdebugging tools was also employed when necessary as well as output of various data to theconsol during the rendering loop. Because this project was of a prototyping and researchnature it was not tested as rigorously as any commercial product. But the necessary carewas taken to insure that the results are accurate.

Figure 20 Simple Terrain Test Scene


7 of 91 7/5/12 1:27 A


48/91

Figure 21 Screen shot of developing environment displaying the rendering Window containing AABB and the consol window indicating

The number of Objects passing the VFC

Figure 22 Screen shot showing the Oct-Tree partitions


8 of 91 7/5/12 1:27 A


49/91

6.2 Evaluation

Evaluation is absolutely crucial in order to optimize any type of algorithm. The techniquesused to evaluate also have to be relevant to the type of problem. Benchmarks thereforehave to be carried out on the various algorithms, these benchmarks involve being able to

consistently performming timing analysis on the same scene with the same set of cameralocation. The timing functions used also have to be very accurate. Sub millisecond accuracyis therefore needed. With the correct evaluation techniques such as profiling, one can pinpoint the current areas of which algorithms are the limiting factor.

6.2.1 Windows Timing Functions

The appropriate timing functions have to be used. These have to be very accurate have alow latency and must require too much system resources. Various timing functions weretested so that the appropriate one could be selected. When timing it is very important toknow how much time the actual timing function takes. The following are the results of timingfunction benchmarks. Please note that all these timing functions were benchmarkedconsistently. The following is an output of various timing functions that can be performed inwindows on the PC platform. This output measures the performance of varying timingfunctions over different frequencies of use. It outputs for each method the number of timesthe function was called, the total time taken for all the iterations to be performed and the

average time each call takes. QueryPerformanceFrequency() freq = 0 1193182method 0:

QueryPerfCntr..() 100 timestot: 0 760avg time: 6.36952e-006

method 0:QueryPerfCntr..() 500 timestot: 0 3842avg time: 6.43992e-006

method 0:QueryPerfCntr..() 1000 times

tot: 0 11492avg time: 9.63139e-006

method 0:QueryPerfCntr..() 10000 timestot: 0 98118avg time: 8.22322e-006

method 1:GetTickCount() 100 timestot: 0 10


9 of 91 7/5/12 1:27 A


50/91

avg time: 8.38095e-008method 1:

GetTickCount() 500 timestot: 0 20avg time: 3.35238e-008

method 1:GetTickCount() 1000 timestot: 0 72avg time: 6.03428e-008

method 1:GetTickCount() 10000 timestot: 0 233avg time: 1.95276e-008

method 2:

TimeGetTime() 100 timestot: 0 30avg time: 2.51429e-007

method 2:TimeGetTime() 500 timestot: 0 111avg time: 1.86057e-007



method 3:

Pentium internal high-freq cntr() 100 times

tot: 0 11avg time: 9.21905e-008

method 3:Pentium internal high-freq cntr() 500 timestot: 0 20avg time: 3.35238e-008

method 3:Pentium internal high-freq cntr() 1000 timestot: 0 38avg time: 3.18476e-008

method 3:Pentium internal high-freq cntr() 10000 timestot: 0 320

avg time: 2.6819e-008


0 of 91 7/5/12 1:27 A


51/91

6.2.2 Evaluating the Rendering Pipeline

Again various techniques were used to evaluate the rendering pipeline. Evaluating thepipeline as a whole was achieved with camera paths that were recorded and played back.When they were played back the exact time spent for each frame was recorded and writtento a file. This data is used to tweak certain parameters such as the Oct-Tree balancing [seeSection 7] . Timing the frame rate only gives a certain idea of the performance levels. Thisallows the rendering pipeline to be evaluated as a combination of algorithms. However itdoes give a good indication of each individual algorithms performance. Other techniqueshad to be employed for this. The camera paths were also carefully created. Differentalgorithms work more or less well for a variety of situations. Different camera paths wherecreated to try and explore the strengths and weaknesses of each algorithm or a particular

rendering pipeline. The test model of the Millennium Dome had the advantage of containinga whole variety of different situations that could challenge the pipeline. It contained dense aswell as sparse separation of objects. It had small and very large objects, from the thin wirestructures used to hold up the roof to the large towers used to hold the lighting equipment.


1 of 91 7/5/12 1:27 A


52/91

6.2.3 Finding Bottlenecks

The key to re fining the rendering pipeline is to find bottlenecks. These are the veryspecific stages of the pipeline that are limiting the overall pipeline. Usually when discussingrendering pipeline bottlenecks the terms; Fill Rate limited, CPU limited and Geometry limited are used. Fill Rate limited: This can be thought of as the pixel rate. GPUs can write a maximumamount of pixels to its buffers. Various factors can influence this such as the GPUs memoryarchitecture. This limitation occurs at the GPU. Detecting if a rendering pipeline is fill ratelimited is very simple. One just has to increase the display resolution and monitor the framerate. If the rate drops at higher resolutions then the pipeline is fill rate limited. The causes

are rasterisation and resolution. For example textures use up fill rate, the higher theresolution of the texture the more fill rate is used up. Image based occlusion culling alsoutilises fill rate. This is because more objects need to be raterised such as the Oct-Treenodes in order to test them for occlusion. Fill rate limitation usually occurs towards the end of the pipeline. CPU limited: This is a very general term that refers to any bottleneck occurring at the CPUstage of the rendering pipeline. One example of this would be when using a conservativeView Frustum Culling algorithm on AABB with a scene containing thousands of objects. Thetime taken by the CPU to perform the culling is stalling the rest of the pipeline. DetectingCPU limitation again is usually straight forward. Profiling the application can pinpoint whichpart of the rendering pipeline calculated by the CPU is using up the most resources. CPUlimitation usually occurs at the beginning of the pipeline. Geometry limited: This refers to the maximum amount of geometry or triangles that can besent down a rendering pipeline for a given frame rate. The causes of this limitation are muchmore difficult to pinpoint. They can be at CPU, GPU or the bus that links them both. Because

of the nature of this type of bottleneck it can occur more or less anywhere in the renderingpipeline. During every iteration of development the appropriate tools were employed to find thebottlenecks and remove them. There are various tools on the market that are can help thisprocess. One example is VTune, this is a tool designed by Intel that specifically analyses


2 of 91 7/5/12 1:27 A


53/91

code for their range of CPUs. It produces information indicating which portions of code theCPU spends most of its time executing. It also comes with a very specialized compiler thatoptimizes code to maximise its performance on the Pentium processor. AMD also have anequivalent tool. One can use such tools to help speed up the process of finding bottlenecks.

An alternative to using such tools is to time very specific portions of the rendering pipeline.

This was performed for this project. By doing so, one can determine the exact percentage of the frame rendering time that is spent at which stage of the pipeline. This can be a verypowerful tool in establishing where to focus attention to. For example the second renderingpipeline [see Section 4.2] used conservative view frustum culling on bounding boxes. Withthe use of specific timing one could clearly see that too much time in proportion of thepipeline was spent at this stage. With this in mind the next pipeline concentrated onoptimising this culling procedure with the use of spatial sub division. All of these tools wereused and reused time and time again with each generation of the rendering pipeline. Theywhere also used to optimize certain parameters such as the Oct-Tree balancing. Thefollowing section of this report will present in much greater detail some of the results thatwhere achieved all of which were obtained with such tools.


3 of 91 7/5/12 1:27 A


54/91

7 Results

This section of the report will present some of the different benchmark results that where

obtained when exploring different algorithms and their implementations as well as some of the effects different parameters had on the performance. Please note that during theDevelopment Cycle [see Section 3] the algorithms were changed but so was theimplementation. Different OpenGL routines were used and various GPU specific extensionswere utilized. These extensions add a substantial performance to the application. Beforetaking the final benchmarks some of the pre