transgaming_swiftshader_whitepaper-20130129

1Copyright © 2013 TransGaming Inc. All rights reserved.

TRANSGAMING INC.WHITE PAPER:SWIFTSHADER TECHNOLOGYJANUARY 29, 2013

For some time now, it has been clear that there is strong momentum for convergence between CPU and GPU technol-ogies. Initially, each technology supported radically differ-ent kinds of processing, but over time GPUs have evolved to support more general purpose use while CPUs have evolved to include advanced vector processing and multiple execu-tion cores. In 2013, even more key GPU features will make their way into mainstream CPUs.

At TransGaming, we believe that this convergence will continue to the point where typical systems have only one type of processing unit, with large numbers of cores and very wide vector execution units available for high perfor-mance parallel execution. In this kind of environment, all graphics processing will ultimately take place in software. Through our SwiftShader software GPU toolkit, TransGaming has long been a world leader in software based rendering, with widespread adoption of our technology by some of the

world’s top technology companies, and a US patent on key techniques issued in late 2012.

This whitepaper explores the past, present, and future of software rendering, and why TransGaming expects that the technology behind SwiftShader will be a critical compo-nent of future graphics systems.

SwiftShader TodayIn 2005, TransGaming launched SwiftShader, for the first

time providing a software-only implementation of a com-monly used graphics API (Microsoft® Direct3D®), including shader support, at performance levels fast enough for real-time interactive use. Since then, SwiftShader has found an important niche in the graphics market as a fallback solu-tion – ensuring that even in cases where available hardware or graphics drivers are inadequate, out of date, or unstable, our customers’ software will still run.

SwiftShader:Why the Future of 3D Graphics is in Software

SwiftShader: Why the Future of 3D Graphics is in Software


This fallback case is a critical one for software that needs to run no matter what system an end user has in place. TransGaming has licensed SwiftShader to compa-nies such as Adobe for use with Flash® as a fallback for the Stage3D® API, and to Google to implement the WebGL® API within Chrome® and Native Client®. Beyond this, SwiftShader has found customers in markets as diverse as

medical imaging and the defense industry. All of these cus-tomers require a solution that will put the right pixels on the screen 100% of the time.

Another important area where SwiftShader is being used today is in cloud computing and virtualization systems. Servers in data centers that include GPU capabilities are cur-rently substantially more expensive than normal servers. Using SwiftShader and software rendering thus allows substantial savings and flexibility for developers with server-oriented appli-cations that require some degree of graphics capability. A key part of the reason that SwiftShader is useful as a fallback option in situations where a hardware GPU is not

available or not reliable is that it is capable of achieving performance that approaches that of dedicated hardware. With a 2010-era quad-core CPU, SwiftShader scores 620 points in the popular 3DMark06 DirectX 9 benchmark; this is higher than the scores for many previous generation integrated GPUs.

Software Rendering Future AdvantagesWhile today’s software rendering results are good

enough for some applications, current generation integrat-ed GPUs still have a substantial performance advantage. Why then does TransGaming believe that software render-ing will have a more important role in the future, beyond a reliable fallback?

The answers are straightforward. As CPUs continue to increase their parallel processing performance, they be-come adequate for a wider range of graphics applications, thus saving the cost of additional unnecessary hardware. Hardware manufacturers can then focus resources into

Figure 1: SwiftShader running 3DMark06



optimizing and improving hardware with a single architec-ture, and thus avoid the costs of melding separate CPU and GPU architectures in a system. As graphics drivers and APIs get more complex and diverse, the issues of driver correct-ness and stability become ever more important. In today’s world, software developers must test their applications on an almost infinite variety of different GPUs, drivers, and OS revi-sions. With a pure software approach, these problems all go away. There are no feature variations to worry about, other than performance, and developers can always ship applica-tions with a fully stable graphics library, knowing that it will work as expected no matter what. Software rendering thus saves time and money for all participants in the platform and ecosystem during development, testing and maintenance.

Beyond cost savings, software rendering has numerous additional advantages. For example, graphics algorithms that today use a combination of CPU and GPU processing must split the workload in a suboptimal way, and develop-ers must deal with the complexity of handling different bottlenecks to ensure that each pipeline remains balanced. Software rendering also simplifies optimization and debug-ging by using a single architecture, allowing the use of well established CPU-side profilers and debuggers. A simpler, uni-form memory model also liberates developers from having to deal with multiple memory pools and inconsistent data access characteristics, creating additional freedom for devel-opers to explore new graphics algorithms.

Most importantly however, software rendering allows for unlimited new capabilities to be used at any time. New graphics API releases can always be compatible with exist-ing hardware, and developers can add new functionality at any layer of their graphics stack. The only limits become those of the developer’s imagination.

All of this however can only become true if software rendering can close the performance gap. At TransGaming, we believe that this is very achievable, and that upcoming hardware advances will prove this out. To understand why requires a deeper dive into the technical side of SwiftShader.

SwiftShader: The State of the Art in Software Rendering

This section highlights some of the key technologies that differentiate SwiftShader from other renderers, and il-lustrates how the challenges posed by software rendering can be overcome.

One of the seemingly major advantages of dedicated 3D graphics hardware is that it can switch between different operations at no significant cost. This is particularly relevant to real-time 3D graphics because all of the graphics pipeline stages depend on a certain ‘state’ that determines which cal-culations are performed. For instance, take the alpha blend-ing stage used to combine pixels from a new graphics opera-tion with previously drawn pixels. This stage can use several different blending functions, each of which takes various input arguments. Handling this kind of work is a challenge for traditional software approaches that use conditional statements to perform different operations. The resulting CPU code ends up containing more test and branch instruc-tions than arithmetic instructions, resulting in slower perfor-mance compared to code that has been specialized for just a single combination of states, or hardware with separate logic for each blending function. A naïve software solution that includes pre-built specialized routines for every combina-tion of blending states is not feasible, because combinatorial explosion would result in excessive binary code size.

The practical solution that SwiftShader uses for this type of problem is to compile only the routines for state combinations that are needed at run-time. In other words, SwiftShader waits for the application to issue a drawing command and then compiles specialized routines which perform just those operations required by the states that are active at the time the drawing command is issued. The gen-erated routines are then cached to avoid redundant recom-pilation. The end result is that SwiftShader can support all the graphics operations supported by traditional GPUs, with no render-state dependent branching code in the processing routines. This elimination of branching code also comes with secondary benefits such as much improved register reuse.

This technique of dynamic code generation with spe-cialization, has proven to be invaluable in making software rendering a viable choice for many applications today, and naturally extends to the run-time compilation of current types of programmable shaders. Most importantly, it opens up huge opportunities for future techniques.

In addition to using dynamic code generation, SwiftShader also achieves some of its performance through the use of CPU SIMD instructions. SwiftShader pioneered the implementation of Shader Model 3.0 software rendering by using the SIMD instructions to process multiple elements such as pixels and vertices in parallel. By contrast, the classic



way of using these instructions is to execute vector opera-tions for only a single element. For example, other software renderers might implement a 3-component dot product us-ing the following sequence of Intel x86 SSE2 instructions: mulps xmm0, xmm1

movhlps xmm1, xmm0

addps xmm0, xmm1

pshufd xmm1, xmm0, 1

addss xmm0, xmm1

Note that this sequence requires five instructions to compute a single 3-component dot product - a common operation in 3D lighting calculations. This is no faster than using scalar instructions, and thus many legacy software renderers did not obtain an appreciable speedup from the use of vector instructions. SwiftShader instead uses them to compute multiple dot products in parallel:

mulps xmm0, xmm3

mulps xmm1, xmm4

mulps xmm2, xmm5

addps xmm0, xmm1

addps xmm0, xmm2

The number of instructions is the same, but this sequence computes four dot products at once. Each vector register component contains a scalar variable (which itself can be a logical vector component) from one of four pixels or verti-ces. Although this is straightforward for an operation like a dot product, the challenge that TransGaming has solved with SwiftShader is to efficiently transform all data into and out of this format, while still supporting branch operations.

Earlier versions of SwiftShader made use of our in-house developed dynamic code generator, SwiftAsm, which used a direct x86 assembly representation of the code to be generated. This offered excellent low-level control, but at the cost of the burden of dealing with different sets of SIMD extensions, and of determining code dependencies based on complex state interactions. We’ve since taken things to the next level by abstracting the operations into a high-level

shader-like language, which integrates directly into C++. This layer, which we call Reactor, outputs an intermediate representation that can then be optimized and translated into binary code using a full compiler back-end. We chose to use the well-known LLVM framework due to its excellent support of SIMD instructions and straightforward use for run-time code generation. The combination of Reactor and LLVM forms a versatile tool for all dynamic code genera-tion needs, exploiting the power of SIMD instructions while abstracting the complexities.

A simple example of how Reactor is used in the imple-mentation of the cross product shader instruction illustrates this well:

void ShaderCore::crs( Vector4f &dst,

Vector4f &src0,

Vector4f &src1)

{

dst.x = src0.y * src1.z -

src0.z * src1.y;

dst.y = src0.z * src1.x -

src0.x * src1.z;

dst.z = src0.x * src1.y -

src0.y * src1.x;

}

This looks exactly like the calculation to perform a cross product in C++, but the magic is in the use of Reactor’s C++ template system. The Reactor Vector4f data type is defined with overloaded arithmetic operators that generate the re-quired instructions for SIMD processing in the output code.

In addition to eliminating branches and making effec-tive use of SIMD instructions, SwiftShader also achieves substantial speedups through the use of multi-core pro-cessing. While this may at first seem obvious and relatively straightforward, it poses both challenges and opportunities. Graphics workloads can be split into concurrently execut-able tasks in many different ways. One can choose between sort-first or sort-last rendering, or a hybrid approach. Each task can also be divided into more tasks through data paral-lelism, function parallelism, and/or instruction parallelism. Subdividing tasks and scheduling them onto cores/threads comes with some overhead, and they are typically fixed



processes once a specific approach is chosen. TransGaming has identified opportunities to minimize the overhead and in some cases even exceed the theoretical speedup of using multiple cores, by combining dynamic code generation with the choice of subdivision/scheduling policy. Information about the processing routines, obtained during run-time code generation, can be used during task subdivision and scheduling, while information about the subdivision/sched-uling can be used during the run-time code generation.

We believe this advantage to be unique to software ren-dering, because only CPU cores are versatile enough to do dynamic code generation, intelligent task subdivision/sched-uling, and high throughput data processing.

Further information about these techniques can be found in TransGaming’s patent filing, Patent #8,284,206: General purpose software parallel task engine. While the pat-ent was granted in late 2012, the original provisional patent was filed in early 2006, well before other modern software rendering efforts such as Intel’s Larrabee became public.

Convergence and TrendsThe previous sections of this whitepaper show some

of the substantial advantages of software rendering, and demonstrate that the technology to use the full computing power of a modern CPU as efficiently as possible is already here. In order to fully understand the coming convergence between CPU and GPU however, we must also consider the evolution of the GPU side of the equation.

Firstly, we must understand what makes modern GPUs exceptionally fast parallel computation engines, and what the limits of growth on the approaches used to provide that speed may be.

Modern GPUs have two critical features that enable the majority of their performance: they provide a large number of heavily pipelined parallel computation units, and they drive many execution threads through these units simultaneously. This allows them to hide the long latencies that frequently occur when executing operations such as texture fetching. While one thread is waiting for a texture fetch result, another thread occupies the computation units. Context switches are therefore designed to be very efficient on a GPU.

Keeping many threads active simultaneously requires a large number of registers to be available. The more registers a given instruction sequence requires, the fewer threads can be run simultaneously.

The lowest organizational level of computation on a GPU is known as a ‘warp’ on NVIDIA GPUs, and a ‘wavefront’ on AMD GPUs. This is similar to the SIMD width in a CPU vector unit. Current generation GPU hardware typically uses 1024-bit or 512-bit wide SIMD units, compared to the 256-bit wide SIMD units used by current generation CPUs.

The wide SIMD approach also has some important limitations. One is that control statements within a given instruction sequence cause divergence, which requires eval-uating multiple code paths. With a wider SIMD width, this divergence becomes more common, eliminating some of the execution parallelism. Another limiting factor for graphics processing is that pixels are processed in rectangular tiles, so rendering triangles regularly results in leaving some lanes unused.

Another limitation is the number of registers available. Larger register files lower computational density, so GPU manufacturers must balance that against stalls caused by running out of storage for covering RAM access latencies

By contrast, CPUs are optimized for low-latency opera-tion. On a CPU core, a significant amount of logic is devoted to scheduling logic that allows many functional units to be used simultaneously through out-of-order execution. Branch prediction units and shorter SIMD widths reduce the penal-ties for branch-heavy code, and more die space is devoted to caches and memory-management functionality. CPUs typi-cally support running at significantly higher clock frequen-cies as well.

CPUs are now evolving to support increased parallelism at the SIMD width level as well as with additional execution units available to simultaneous threads, and larger num-bers of CPU cores per die. Intel’s Haswell chips, available later this year, will include three 256-bit wide SIMD units per core, two of which are capable of a fused multiply-add operation. This arrangement will process up to 32 floating-point operations per cycle: with four cores on a mid-range version of this architecture, this provides about 450 raw GFLOPS at 3.5 GHz.

Intel’s AVX2 instruction set offers room to increase the SIMD width size to 1024 bits, which would put the raw CPU GFLOPS at similar levels to the highest end GPUs currently available.

At the same time, GPUs are becoming more and more like CPUs, adding more advanced memory management fea-tures such as virtual memory and the corresponding MMU



complexity that is required. GPU instruction scheduling is be-coming more complex as well, with out-of-order features such as register scoreboarding, and ILP extraction features such as superscalar execution. Furthermore, GPU-vendor sponsored research suggests that running fewer threads simultaneously might lead to better performance in many cases1.

Die-level Integration and BandwidthOne of the trends that displays the clearest indications

of convergence between CPUs and GPUs is the increasing frequency of die-level integration of current-generation dif-ferentiated CPU and GPU units. This trend has become more and more important with the rise of mobile devices, which require both graphics and CPU performance in a single low-power chip. Most desktop chips sold today also include an on-die GPU.

The very existence of this important market shows the value of CPU / GPU convergence. While today the market is

served by chips that integrate separate units on the same die, the potential advantages of a fully unified chip are clear – hardware manufacturers would be able to manufac-ture simpler macro-level designs with computation cores that could be used for either general purpose or graphics

1 Better Performance at Lower Occupancyhttp://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

workloads as needs arise in the system.While one of the traditional hallmarks of strength with

GPU computing has been the use of high bandwidth dedi-cated memory, this advantage becomes moot in environ-ments where the GPU must share memory with the CPU.

While there is no question that the availability of high bandwidth memory will continue to be a strength of dis-crete GPU computing, there are ways in which both integrat-ed CPU / GPU packages as well as potential future unified chips can offset this distinction.

One approach to provide increased performance for these chips is on-package memory. This approach has al-ready proven useful in Microsoft’s Xbox 360, which includes a 10 MB eDRAM framebuffer. Intel’s Haswell integrated GPU will optionally include 128 MB of high performance mem-ory for this purpose as well. AMD’s next generation Kaveri Fusion architecture chip is designed to fully share memory and address space between its CPU and GPU components.

Clearly, any future unified architecture chip will not suffer from bandwidth limitations any more than existing integrated designs might.

Computational EfficiencyOne argument often raised in favor of GPU computing is

that GPUs have greater computational efficiency than CPUs.

Device Speed – MHz

Area – mm2

Performance – GFLOPS

Power – Watts

GFLOPS / Watt

GFLOPS / mm2

Discrete GPUs

NVidia Kepler GK104 (GTX 680) 1006 294 3090.4 195 15.85 10.5

NVidia Kepler GK107 (GTX 650) 928 118 812.5 64 12.7 6.88

ATI Radeon HD 7870 XT 975 365 2995.2 185 16.19 8.21

Integrated GPUs

NVidia Tegra 4 GPU Only (est.) 520 ~30 74.8 ~3.8 ~19.68 ~2.49

Intel Ivy Bridge HD4000 GPU Only (est) 1150 ~57 294.4 14.8 19.89 ~5.16

CPU Hardware

Intel Haswell Quad-Core, no GPU (est.) 3100 ~96 ~396.8 ~65 ~6.2 ~4.13

Intel Haswell Single Core (est.) 3100 ~24 ~99.2 ~16 ~6.2 ~4.13

Intel Haswell ULX Single Core (est.) 1500 ~24 ~48 ~4 ~12.0 ~2.0

Table 1: GFLOPS per Watt and GFLOPS per mm2

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf



While this is true today, the advantage is much less than one might think, and there is every reason to believe that it will disappear with future CPU designs as the convergence trends described above continue.

Table 1 summarizes information about the performance per unit area and performance per Watt of both discrete GPUs, integrated GPUs, and Intel’s Haswell CPU. Estimates are drawn from the websites in the footnote below2, with the following additional assumptions:• Tegra 4 GPU area is estimated as 37.5% of overall SoC area

• Tegra 4 GPU TDP is estimated as 50% of overall SoC TDP, based on reported battery size and time estimates for NVidia Shield device; 38 Watt-hour battery, ~5 hour battery life

• Sandy Bridge HD 4000 GPU size is estimated as 31% of overall die, based on visual estimates from die photographs

• Sandy Bridge HD 4000 GPU power consumption taken from AnandTech measurements

• Haswell core size is estimated as 13% of overall die, based on visual estimates of die photographs

• Haswell 3.1 GHz power one core power estimated based on data from Tom’s Hardware article above

• Haswell ULX 1.5 GHz TDP power estimated based on 50% reported 10 Watt TDP from Anandtech article above

While the data above clearly shows GPUs as more efficient in raw GFLOPS performance, the result is hardly

2 Data for Table 1 was compiled from the sites below:http://en.wikipedia.org/wiki/GeForce_600_Serieshttp://www.zdnet.com/nvidia-claims-tegra-4-gpu-will-outperform-the-

ipad-4s-a6x-7000009888/http://www.anandtech.com/show/6550/more-details-on-nvidias-tegra-

4-i500-5th-core-is-a15-28nm-hpm-ue-category-3-ltehttp://i1247.photobucket.com/albums/gg628/mrob27/HSW-4c-GT2-rev2_

zps3212a12b.pnghttp://www.extremetech.com/computing/144778-atom-vs-cortex-a15-vs-

krait-vs-tegra-3-which-mobile-cpu-is-the-most-power-efficienthttp://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_

units#Southern_Islands_.28HD_7xxx.29_serieshttp://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-

bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html

http://www.anandtech.com/show/5878/mobile-ivy-bridge-hd-4000-investi-gation-realtime-igpu-clocks-on-ulv-vs-quadcore

http://www.tomshardware.com/gallery/haswell-665x269,0101-364392-0-2-3-1-jpg-.html

http://www.anandtech.com/show/6655/intel-brings-core-down-to-7w-introduces-a-new-power-rating-to-get-there-yseries-skus-demystified

http://www.chip-architect.com/news/2012_04_19_Ivy_Bridges_GPU_2-25_times_Sandys.html

overwhelming. Given the potential for scaling raw GFLOPS on a CPU-style architecture at relatively low power cost by providing wider SIMD vectors, there is no reason to imag-ine that computational efficiency is a bar to future unified architectures.

Transitioning to Unified ComputingAs we have seen above, there are strong trends to-

wards convergence between the GPU and CPU, and no obvious obstacles to unified computing. Are there any other important factors required to complete a transition to fully unified architectures?

One caveat in the above compute density comparisons is that so far we’ve only looked at the programmable hard-ware. The relative amount of fixed-function GPU hardware has been shrinking, but it is worth considering the question of whether anything changes if we implement these func-tions in software.

It is a common misconception that replacing fixed-function hardware would require significant additional programmable hardware in order to achieve the same peak throughput. To illustrate why this is incorrect, we’ll focus on the texture units, which remain the most prominent fixed-function logic on today’s GPUs. These texture units are at any given time either a bottleneck or underutilized. GPU manufacturers try to prevent them from being a bottleneck by having more texture units than what is needed by the average shader (bandwidth and area permitting). As a result these additional texture units are, more often than not, idle.

By contrast, with unified hardware, the additional programmable hardware required would be based on the average utilization. We have confirmed this experimentally by collecting statistics through the use of run-time perfor-mance counters in SwiftShader. The average TEX:ALU ratio in observed shaders is lower than the ratio of TEX:ALU hard-ware in contemporary GPUs. Software rendering on unified hardware thus has the potential to outperform dedicated hardware, by having more programmable logic available that does not suffer from underutilization. Even in the case where the GPU’s texture units are a bottleneck, software rendering on a unified CPU may outperform it for simple sampling operations with high cache locality.

The second factor that makes it feasible to replace fixed-function texture samplers with programmable hard-ware is the fact that texture sampling is by nature pipelined,

http://en.wikipedia.org/wiki/GeForce_600_Series

http://www.zdnet.com/nvidia-claims-tegra-4-gpu-will-outperform-the-ipad-4s-a6x-7000009888/

http://www.zdnet.com/nvidia-claims-tegra-4-gpu-will-outperform-the-ipad-4s-a6x-7000009888/

http://www.anandtech.com/show/6550/more-details-on-nvidias-tegra-4-i500-5th-core-is-a15-28nm-hpm-ue-category-3-lte

http://www.anandtech.com/show/6550/more-details-on-nvidias-tegra-4-i500-5th-core-is-a15-28nm-hpm-ue-category-3-lte

http://i1247.photobucket.com/albums/gg628/mrob27/HSW-4c-GT2-rev2_zps3212a12b.png

http://i1247.photobucket.com/albums/gg628/mrob27/HSW-4c-GT2-rev2_zps3212a12b.png

http://www.extremetech.com/computing/144778-atom-vs-cortex-a15-vs-krait-vs-tegra-3-which-mobile-cpu-is-the-most-power-efficient

http://www.extremetech.com/computing/144778-atom-vs-cortex-a15-vs-krait-vs-tegra-3-which-mobile-cpu-is-the-most-power-efficient

http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units#Southern_Islands_.28HD_7xxx.29_series

http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units#Southern_Islands_.28HD_7xxx.29_series

http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html



http://www.anandtech.com/show/5878/mobile-ivy-bridge-hd-4000-investigation-realtime-igpu-clocks-on-ulv-vs-quadcore

http://www.anandtech.com/show/5878/mobile-ivy-bridge-hd-4000-investigation-realtime-igpu-clocks-on-ulv-vs-quadcore









consisting of several logical stages, most of which are con-figurable: address generation, mipmap level of detail (LOD) determination, texel gather, and filtering. Different func-tional units inside a CPU core can perform work at each of these stages. For example, texel gathering can be performed by a load/store unit while the SIMD FP ALUs are completing filtering on a previous sample.

Furthermore, SwiftShader’s dynamic code generation can completely eliminate the LOD determination stage when mipmapping is inactive - either when disabled explicitly or when the texture only has one level. Similarly, filtering can range from none at all, to trilinear anisotropic filtering, and beyond. Modern GPU hardware provides support for one trilinearly filtered sample per clock cycle, implementing anisotropic filtering using multiple cycles. This means that during anisotropic filtering, some of the other stages are idle. To implement more advanced filtering, a shader program is required, i.e., software. Likewise on the CPU we can generate specialized routines for 1D, 2D and 3D texture addressing. All this leads to the general observation that as the usage diver-sifies, bottlenecks and underutilization can be addressed by new forms of programmability and unification.

This brings us to a third argument. Performing texture operations on unified hardware enables global optimiza-tions. We have observed that many shaders sample multiple textures using the same or similar texture coordinates. This enables SwiftShader’s run-time compiler back-end to elimi-nate common sub-expressions. For instance if a shader uses a regular grid of sample locations to implement a higher or-der filter, fewer addresses have to be computed than if each of the samples computed both coordinates.

Unifying the CPU and GPU also means that some por-tions of the GPU’s fixed-function hardware could be added to the CPU’s architecture as new instructions, and these could then be used for new purposes as well. One major example of this is the addition of ‘gather’ support to com-modity CPUs, which will become available with Intel’s Haswell CPU later this year. This will speed up software rendering considerably, by transforming serial texel fetches into a parallel operation. But the gather instruction can also speed up a multitude of other graphics operations, such as vertex attribute fetches, table lookups for transcendental functions, arbitrarily indexed constant buffer accesses, and more. The possible uses go far beyond graphics. The decou-pling of the gather operation from filtering also enables the

optimization of texture sampling operations that require no filtering. In recent years the use of graphics algorithms re-quiring non-texture related memory lookups has increased, spawning the addition of a generic gather intrinsic in shader languages. Besides gather, several other generic instruc-tions could be added to the CPU to ensure efficient texture sampling in software. For example, to efficiently pack and unpack small data fields, a vector version of the bit manipu-lation instructions (BMI1/2) could be added.

Note that our analysis above has covered the decou-pling and unification of every major stage in texture sam-pling. This approach would eliminate both bottlenecks and underutilization, and would also enable new optimizations. It may seem like a large number of instructions would still be required to implement texture sampling, but it is impor-tant to keep in mind that on a unified architecture each core can execute any operation. When GPU hardware first became programmable and spent multiple cycles per pixel, vendors were still able to improve performance by adding additional arithmetic units running in parallel. Likewise, breaking up texture sampling into simpler operations allows spreading the work over more CPU functional units and cores.

A similar analysis can be made for fixed-function raster output operations (ROP). In fact for GPU hardware that sup-ports OpenGL’s GL_EXT_shader_framebuffer_fetch exten-sion3, colour blending is already performed by the shader units. This is, at the time of writing, only supported by mo-bile GPUs, a fact that also illustrates that replacing dedicat-ed hardware with programmable hardware doesn’t have to be detrimental to performance and power consumption.

The ROP units are typically also responsible for anti-aliasing (AA). Interestingly, this functionality was broken in some versions of AMD’s R600 architecture4, but this did not prevent them from launching the product, as the drivers were able to implement anti-aliasing using the shader units. Note that the compute density has since increased signifi-cantly, so not having dedicated AA hardware would now have an even lower impact. Moreover, replacing dedicated hardware with more general-purpose compute units al-lows ROP unit die area to be used for many other purposes.

3 An OpenGL ES extension: http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_frame-

buffer_fetch.txt4 Reported here:http://www.theinquirer.net/inquirer/news/1046479/ati-r600-manage-

pixels-clock

http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt

http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt

http://www.theinquirer.net/inquirer/news/1046479/ati-r600-manage-pixels-clock

http://www.theinquirer.net/inquirer/news/1046479/ati-r600-manage-pixels-clock



Finally, there has recently been a great deal of successful research into screen-based AA algorithms, which do not re-quire any dedicated hardware5.

This brings us to a more general discussion of dedi-cated versus programmable hardware. For decades, com-puting power has increased at a faster rate than memory bandwidth.It’s easy to see why this will remain a universal truth as long as Moore’s Law holds: with every halving of the semiconductor feature size, four times more logic can be fit on the same area, but the contour can only fit two times more wires. Furthermore, the pin count of a chip does not scale linearly with semiconductor technology. Hitting the in-evitable “memory wall” has been staved off several times by adding more metal layers, by building a hierarchy of caches, and by increasing the effective frequency per pin.

Going forward, other techniques will become essential to keep scaling the effective bandwidth. An extensive study6 points out the best candidates, but probably the most in-teresting result is what won’t work well: shrinking the core sizes. This is the least effective approach because when the majority of the die space is occupied by storage and com-munication structures to feed the execution logic, smaller execution logic only leads to a marginal increase in com-pute density. In essence, this is an argument against simple programmable cores and fixed-function logic in the long term. In essence, this has been one of the driving forces that has enabled graphics hardware to become programmable thus far, and it shows that even more programmability can be achieved in the future at a low cost. Eventually there won’t be a significant advantage in using small GPU-like cores, and every core can instead have a more versatile CPU-like architecture.

ConclusionTransGaming believes that an eventual convergence of

CPU and GPU computing into a revolutionary unified archi-tecture is inevitable. This merger will give developers and end users the best of both worlds: highly parallel program-ming environments that interface easily with scalar code,

5 See:http://iryoku.com/aacourse/downloads/Filtering-Approaches-for-Real-Time-

Anti-Aliasing.pdfhttp://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialias-

ing-from-the-gpu-to-the-cpu6 Rogers, et al.http://www.ece.ncsu.edu/arpers/Papers/isca09-bwwall.pdf

full confidence that end users will always see the right pix-els on the screen, regardless of drivers or operating systems, and the limitless potential for innovation that comes with software-based approaches7.

Some of this convergence is already apparent with upcoming new hardware such as Intel’s Haswell processor. AMD’s Heterogeneous Computing architecture is another proof point on this roadmap, integrating CPU and GPU ele-ments into a single processor, controlled through dynami-cally generated code. Non-graphics domains are also seeing benefits from similar dynamic, software-based approaches – for example, NVidia’s Tegra 4 processor includes a fully software controlled radio.

TransGaming’s SwiftShader technology offers a pioneer-ing approach to software rendering, backed by powerful IP. SwiftShader is uniquely suited to providing TransGaming’s customers with the ability to navigate the transition from today’s mixed hardware through to future architectures that we can only speculate about. SwiftShader’s dynamic code generation approach allows TransGaming to imple-ment commonly used graphics APIs such as Direct3D 9 and OpenGL ES 2.0 on a variety of different contemporary sys-tems, while paving the way towards a future where a graph-ics library is simply a set of building blocks that developers make use of on a piece by piece basis.

Many challenges remain to be overcome in order for this vision of unified computing to become a real-ity. TransGaming aims to play a key role in meeting these challenges and in helping to deliver on the resulting innovations.

7 Some interesting ideas well suited to pure software approaches can be found here:

http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf

http://iryoku.com/aacourse/downloads/Filtering-Approaches-for-Real-Time-Anti-Aliasing.pdf

http://iryoku.com/aacourse/downloads/Filtering-Approaches-for-Real-Time-Anti-Aliasing.pdf

http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu

http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu

http://www.ece.ncsu.edu/arpers/Papers/isca09-bwwall.pdf

http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf



Copyright Notice

This document © 2013 TransGaming Inc. All

rights reserved.

Trademark Notice

SwiftShader, SwiftAsm, Reactor, and the

SwiftShader logo are trademarks of

TransGaming, Inc. in Canada and other

countries. Other company and product

names may be trademarks of the respective

companies with which they are associated.

Disclaimer and Limitation of Liability

ALL DESIGN SPECIFICATIONS, DRAWINGS,

PROGRAMS, SAMPLES, AND OTHER

DOCUMENTS (TOGETHER AND SEPARATELY,

“MATERIALS”) ARE BEING PROVIDED

“AS IS.” TRANSGAMING MAKES NO

WARRANTIES, EXPRESSED, IMPLIED,

STATUTORY, OR OTHERWISE WITH RESPECT

TO THE MATERIALS, AND EXPRESSLY

DISCLAIMS ALL IMPLIED WARRANTIES OF

NONINFRINGEMENT, MERCHANTABILITY,

AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to

be accurate and reliable. TransGaming,

Inc. assumes no responsibility for the

consequences of use of such information

or for any infringement of patents or other

rights of third parties that may result from

its use. No license is granted by implication

or otherwise under any patent or patent

rights of TransGaming Inc. Specifications

mentioned in this document are subject

to change without notice. SwiftShader and

Reactor technologies are not authorized

for use in devices or systems without

express written approval or license from

TransGaming, Inc.

For More Information: http://transgaming.com/swiftshader

TransGaming Inc. (TSX-V: TNG) is the global leader in developing and delivering platform-defining social video game experiences to consumers around the world. From engineering essential technologies for the world’s leading companies, to engaging audiences with truly immersive interactive experiences, Trans-Gaming fuels disruptive innovation across the entire spectrum of consumer technology. TransGaming’s core businesses span the digital distribution of games for Smart TVs, next-generation set-top boxes, and the connected living room, as well as technology licensing for cross-platform game enablement, software 3D graphics rendering, and parallel computing.

Website: http://transgaming.com

About TransGaming Inc.

http://transgaming.com/swiftshader

http://transgaming.com

transgaming_swiftshader_whitepaper-20130129

Documents

swiftshader technologyjanuary

swiftshader todayin

graphics processing

future graphics systems

future of software rendering

customers software

graphics drivers

graphics market