graphics processing units paper.pdf
TRANSCRIPT
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 1/14
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 2/14
Graphics Processing Units GPUs
Róisín Howard
Bachelor of Engineering in Computer EngineeringLimerick
Abstract —this paper will discuss the use of GPUs to carry out
general purpose computing as well as graphic acceleration. The
support for GPUs for general purpose computing will be
discussed along with details of internal architectures of GPUs.
The challenges and opportunities presented by this architecture
for high-performance computing will be outlined. The evolution
of GPUs and GPU languages is prompted by the need for
graphics processing in games. These languages will be outlined
along with their similarities and differences. GPUs and multi-
core CPUs are also coming to the fore in mobile devices.
Keywords-graphics processing unit; GPU; CUDA; OpenCL;
DirectCompute; OpenGL; Cg; NVidia; Khronous Group;Apple;
Intel; AMD; Microsoft; Tegra; HLSL; GLSL; GPGPU
I. INTRODUCTION
User-programmable Graphics Processing Units (GPUs) for
mainstream computing and scientific use is a hot topic in
computer architecture. These GPUs are specialized processor
systems to accelerate the processing of graphics on both
desktops and laptops. OpenCL and CUDA are the main
contender languages for GPU programming available.
Shading languages such as Cg, HLSL and GLSL are availablefor programming the GPU programmable rendering pipeline.
NVidia is one the main companies behind the GPU
and the programming of the GPU. The GeForce nVidia GPU
card is compatible with many graphics APIs. OpenGL and
Microsoft’s DirectX are among the compatible APIs. NVidia
is also branching into the mobile space with the Tegra chips forsmart phones and tablets. Multi-core CPUs are important for
multi-tasking and lowering the power consumption.
GPGPU is a new concept which is general purpose
computing on GPUs. Here the GPU is utilized to exploit data
parallelism which is available in some applications andperform no-graphics processing. The GPU takes on some of
the mathematically intensive tasks leaving the CPU free to deal
with other user tasks.
II. GRAPHICS PROCESSING UNITS
A. The histroy of GPUs
Intel made the iSBX 275 Video Graphics ControllerMultimode Board in 1983. This was for industrial systems
based on the Multibus standard. The card accelerated the
drawing of lines, arcs, rectangles and character bitmaps. It was
based on the 82720 Graphic Display Controller. Direct
memory access (DMA), was used to load the framebuffer,
which accelerated it. It was intended that this board would be
used with Intel’s line of Multibus industrial single board
computer plug-in cards.[1, 2]
Texas instruments released the first microprocessor
with on-chip graphics capabilities, TMS34010, in 1986. Thishad a very graphics-oriented instruction set and could also run
general-purpose code. The IBM 8514 graphics system was
released as one of the first video cards in 1987 to implement
fixed-function 2D primitives in electronic hardware for IBM
PC compatibles.[1]
B. The purpose of a Graphical Processing Unit
A GPU, Graphics Processing Unit manipulates and alters
memory so as to accelerate the building of images. A GPU is
primarily used for the computation of three dimensional (3D)
functions. Lighting effects, transformations and 3D motion aresome of the computations required. These are mathematically-
intensive tasks which would put a strain on the CPU.[1-3]
Embedded systems, mobile phones, personal
computers, workstations and game consoles are some devices
in which GPUs are used. Computer graphics can be
manipulated very efficiently by modern GPUs. Due to their
highly parallel structure they are more effective that general-purpose CPUs for algorithms where the processing of large
blocks of data is done in parallel. More CPU time is freed up
for other tasks by using GPUs.[1, 3]
In 1999 the term GPU was popularized by nVidia
who marketed “the world’s first ‘GPU’, or Graphics Processing
Unit, a single-chip processor with integrated transform,
lighting, triangle setup/clipping, and rendering engines that is
capable of processing a minimum of 10 million polygons persecond”[4], the GeForce 256. This GPU is capable of billions
of calculations per second. It has over 22 million transistors,
compared to the 9 million found on the Pentium III. Quadro. It
is the workstation version which is designed for CAD
applications. The Pentium III. Quadro can process over 200
billion operations a second and deliver up to 17 million
triangles per second.[1, 3]
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 3/14
C. The benefits of GPUs
GPUs process large blocks of data in parallel because of their
highly parallel structure. The processing of large blocks of
data could be in the form of fast sort algorithms of large lists,
or 2D fast wavelet transformations. This makes them moreeffective than general-purpose CPUs. They are used along-
side CPUs for this purpose, by performing those
mathematically intensive tasks it relieves the strain that wouldhave been put on the CPU and it is freed up to perform other
tasks.[1, 5]
III. COMPUTE UNIFIED DEVICE ARCHITECTURE
NVidia developed Compute Unified Device Architecture,
CUDA, for graphics processing. CUDA is the computingengine in nVidia GPUs. By harnessing the power of the GPU
an increases in computing performance is facilitated. CUDA
shares a range of computational interfaces with two
competitors, the Khronos Group and Microsoft, whose
architectures are OpenCL and DirectCompute.[5, 6]
Access to the virtual instruction set and memory of
the parallel computational elements in CUAD GPUs is given
to developers through CUDA. Computations like those on
CPUs are accessible using CUDA for the latest nVidia GPUs.
However GPUs have parallel throughput architectures unlike
CPUs which execute a single thread very quickly; this
emphasises executing many threads slowly. Solving these
general purpose problems on GPUs is known as GPGPU.[5, 7]
A. Strengths of CUDA
There are several advantages of CUDA over traditional
GPGPUs using graphic APIs. CUDA offers full support for
integer and bitwise operations. This includes integer texturelookups. Scattered reads are also implemented meaning that
code can read from arbitrary addresses in memory. CUDA
has the advantage of faster downloads and readbacks to and
from the GPU. A shared memory region is also offered;
memory can be shared amongst threads. As a result a user-
managed cache can be availed of, which enables higherbandwidth than possible using texture lookups. CUDA also
supports a wide range of libraries and tools which are of use to
developers, Figure 1.[5, 8]
Figure 1. CUDA Libraries and Tools[8]
B. CUDAs weaknesses
However there are also some limitations. CUDA-enabled
GPUs are only available from nVidia unlike OpenCL. Aperformance hit due to system bus bandwidth and latency may
be incurred by copying between host and device memory.Asynchronous memory transfers handled by the GPUs DMA
engine could partially alleviate this. Due to optimisation
techniques the compiler is required to employ to use limited
resources valid C/C++ may sometimes be flagged and prevent
compilation.[5]
C. CUDA programming model
The CPU is known as the host and the GPU is the compute
device, it is the coprocessor to the CPU in the CUDAprogramming model. Data will need to be shared between
both devices as they each have their own memory. The kernel
is an application or a program that runs on the GPU and when
it is launched it is executed as an array of parallel threads.
This execution is shown in Figure 2. A block can only contain
a certain number of threads so threads can be grouped together
to form a grid of thread blocks.[9, 10]
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 4/14
Figure 2. CUDA kernel threads[10]
D. CUDA architecture
Figure 3 shows an example of the CUDA architecture. Here it
can be seen that OpenCL and DirectCompute are supported
applications on the CUDA platform for nVidia hardware,
along with CUDA they are the device-level API support
offered by nVidia. Language integration is also possible. TheCUDA run-time application can be in C, C++, Fortran,
Python, or Java, etc. The CUDA Architecture consists of
parallel compute engines inside nVidia GPUs (1). It also
contains OS kernel-level support for hardware initialization
and configuration (2). The user-mode driver, which provides
a device-level API for developers (3) and a PTX instructionset architecture (4) for parallel computing kernels and
functions, is also shown.[11]
Figure 3. CUDA Architecture[11]
IV. OPEN COMPUTING LANGUAGE
Open Computing Language, OpenCL, is a cross-platform,
parallel programming framework. It is the first truly open and
royalty-free programming standard for general-purpose
computations on heterogeneous systems. OpenCL provides a
uniform programming environment for software developers to
write efficient, portable code for devices using a diverse mix
of multi-core CPUs, GPUs, and other parallel processors suchas DSPs. OpenCL includes a language for writing kernels and
APIs (Application Programming Interfaces) that are used to
identify and then control the platforms. Using task-based and
data-based parallelism OpenCL provides parallel
computing.[12]
OpenCL is maintained by the non-profit technology
consortium Khronos Group[13]. It has been adopted by
Intel[14], Advanced Micro Device (AMD)[15], nVidia[16]
ARM Holdings[17] and IBM[18]. OpenCL gives any
application access to the graphics processing unit for non-
graphical computing, extending the power of the GPU beyond
graphics.[12]
Apple Inc. initially developed OpenCL and hold thetrademark[19]. OpenCL was refined into an initial proposal in
collaboration with technical teams at nVidia, Intel, AMD and
IBM. The proposal was submitted to the Khronos Group in
2008. The goal was to have a cross platform environment for
general purpose computing on GPUs. Representatives from
software companies, CPU, GPU and embedded-processor
joined together to form the Khronos Compute Working Group
to finish the technical details on the specification for
OpenCL1.0. Once the specification was reviewed by Khronos
members and approved it was released to the public by the end
of 2008. The world’s first conformant GPU implementation
of OpenCL for both windows and Linux was shipped in June
2009.[12, 16]
A. Strengths of OpenCL
OpenCL is an open and royalty free language and the fact that
code is portable across devices is a big advantage. It is a C –
like language for heterogeneous devices. It can be used on
parallel CPU architectures and it is not vendor specific.
OpenCL provides a common language for writing
computational “kernels”, and a common API for managingexecution on target devices. OpenCL implementations
already exist for nVidia and AMD GPUs and for x86 CPUs.
B. Weaknesses of OpenCL
OpenCL has some limitations. It is a low-level API whichmeans that developers are responsible for a lot of plumbing,
lots of objects/handles to keep track of. They are also
responsible for thread safety; certain types of multi-accelerator
codes are much more difficult to write than in CUDA. There
is a need for OpenCL middleware and libraries, such as the
libraries and tools available for CUDA. OpenCL code must
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 5/14
deal with hardware diversity. Many features are optional and
are not supported by certain devices. Due to the diversity of
hardware on which OpenCL must operate a single kernel islikely to not achieve peak performance on all device
types.[20]
C. OpenCL architecture
The handling of passing data to and from your processing
environment and, the compiler of the OpenCL code is the
main part of the OpenCL framework. The main stages of
execution are shown below in Figure 4. Setting up and
coordinating the host environment (with n processors -
including multiple GPUs) so that it can then distribute the data
and compile the code efficiently are the key parts OpenCL
handles. OpenCL then has control of each process. This means
it can store the main progress of the code until the completion
of all the desired operations, when either more operations can
be performed or the data from the GPU can then be handed
back to main memory on the CPU. OpenCL depends on the
driver provided by hardware in order for it to besupported.[21]
Figure 4. OpenCL Architecture[21]
D. OpenCL programming model
Similarly to CUDA, OpenCL has kernels. One or morekernels make up an application or a program that runs on the
GPU and when it is launched it is executed as an array ofparallel work items. Work groups contain the array of parallel
work items. Kernels run over global dimension index range
which is known as an NDRange, shown in Figure 5.[22-24]
Figure 5. OpenCL NDRange[24]
V. OPENCL VERSUS CUDA
Since the inception of OpenCL there have been manycomparisons between it and CUDA. Correct implementation
of OpenCL for the target architecture performs “no worse”than CUDA. Portability is the key feature of OpenCL. It is
not vendor specific like CUDA, which only runs on nVidia
devices. This has both advantages and disadvantages
associated with it.[5, 12]
CUDA is limited to nVidia hardware thus it is moreacutely aware of the platform upon which it will be executing.
More mature compiler optimizations and execution techniques
are provided as a result. This gives CUDA the upper hand as
OpenCL code needs to be prepared to deal with much greater
hardware diversity. GPU-specific technologies cannot be
directly used by the programmer.[5, 12]
CUDA has a much larger userbase and codebase than
OpenCL due to the maturity of CUDA. The developer canadd in optimizations manually to the kernel code. OpenCL
had less mature compilation techniques. As the OpenCL
toolkit matures the gap between it and the CUDA toolkit will
converge.[5, 12]
The Scalable Heterogeneous Computing Benchmark
Suite (SHOC) was used to compare CUDA and OpenCL
kernels on nVidia GPUs. CUDA performs better on NVIDIA
GPUs than OpenCL according to the tests. The test measuresthe number of floating point operations per second in
GFLOPS in reference to the kernels. The graph of results isshown in Figure 6.[25]
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 6/14
Figure 6. CUDA vs OpenCL on nVidia GPU[25]
A. Similarities between CUDA and OpenCL
The programming model used by CUDA is similar to that
used by OpenCL. Figure 7 shows a comparison of terms for
the data parallelism models. The CPU is the host for bothmodels and kernels executed form the application which
contain parallel thread blocks in a grid in CUDA terms.[23]
Figure 7. Mapping of terms for data parallelism models- OpenCL toCUDA[23]
B. Differences between CUDA and OpenCL
CUDA is hardware specific whereas OpenCL is not vendor
specific. Due to this fact CUDA knows the hardware on
which it runs and can be optimized for it. OpenCL has to be
adapted for each different hardware vendor and may not
perform as well as CUDA as a result.[5, 12]
OpenCL is an open language, very portable and
maintained by the Khronos Group, CUDA is not an open
language. CUDA has been around for longer than OpenCLthus it has a large code and user-base; it is a more maturelanguage. OpenCL’s compilation techniques are less mature
and the programmer needs to do a lot more low level
programming than with CUDA.[5, 12]
VI. ALTERNATIVE LANGUAGES
A. An overview of DirectCompute
Microsoft developed DirectCompute. This is an API that
supports GPGPU on Microsoft Windows Vista and Windows
7. DirectCompute is part of the DirectX collection of APIs.
Although it was initially release with the DirectX 11 API, it
runs on both DirectX 10 and DirectX 11 GPUs.[26-28]
According to nVidia’s DirectCompute programming
guide DirectCompute as a new type of shader which exposes
the compute functionality of the GPU. This compute shader
has much more general purpose processing capabilities than
the normal shader.[29]
There doesn’t have to be a fixed mapping between
the data being processed and the threads doing the processing
with DirectCompute. This means that one thread can process
one or many data elements, and the number of threads being
used to perform the computation is controlled by the
application directly.[29]
DirectCompute has thread group shared memory
which allows groups of threads to share data, and can reduce
bandwidth requirements significantly. Similarly to other
Compute APIs, Compute Shaders do not directly support any
fixed-function graphics features with the exception of
texturing.[29]
B. Advantages of DirectCompute
There are several advantages of DirectCompute over other
GPU computing solutions. Direct3D is integrated with
DirectCompute which means it has efficient interoperability
with D3D graphics resources. All texture features are
included but LOD must be specified explicitly. The HLSL
shading language is used by DirectCompute. A single API is
provided across all graphics hardware vendors on Windows
platforms as a result there is some level of guarantee of
consistent results across the different hardware.[29]
C. An overview of OpenGL
The Open Graphics Library, OpenGL, is an API for GPUs.
The procedures and functions used to specify the objects and
operations need to produce 3D images are contained in this
interface. Silicon Graphics Incorporated designed OpenGL in
1992. The Khronos Group manage OpenGL.[30-33]
OpenGL is designed as window-system andoperating-system independent, it also is network-transparent.
High performance, visual compelling graphics software
applications can be created using OpenGL on PCs
workstations or supercomputers. It was used in applications
such as CAD and video games.
All the features of the latest graphics hardware are
exposed by OpenGL. Shown in Figure 8 is the OpenGL
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 7/14
client-server model. Once the hardware and software
configuration are compliant this model guarantees consistent
presentation on any compliant hardware and software
configuration.[30-34]
Figure 8. OpenGL client-server model[34]
D. Advantages of OpenGL
Due to the fact that OpenGL is a C-based API it is extremely
portable and widely supported. OpenGL provides functions
for an application to generate 2D or 3D images and allowsthese rendered images to be copied to its own memory or
displayed on the screen. The OpenGL specification is adhered
to for every implementation of OpenGL. A set of
conformance tests must be passed, thus implementation is
reliable. Similarly to OpenCL, OpenGL’s specification is
controlled by the Khronos Group. This guarantees industry
acceptance as the members of this industry consortium aremany of the major companies in the computer graphics
industry.[30-34]
VII. SHADERS
A computer program that is used to calculate rendering effectson graphics hardware is a shader. A shader is used to program
the GPU programmable rendering pipeline. Programming
languages adapted to map on shader programming are known
as shading languages. Instructions are send to the GPU by the
CPU in the form of a compile shading language program.[35,
36]
The geometry is transformed and lightingcalculations are performed within the vertex shader. Somechanges in the geometrics in the scene are performed if a
geometry shader is in the GPU. The calculated geometry is
subdivided into triangles which are then broken down into
pixel quads. Transformation of 3D data into useful 2D data
for displaying by the frame buffer is done by the graphic
pipeline using the above steps from the shader program.[35,
36]
The GPU is allowed to function as a stream processor
since all fragments can be thought of as independent, thus
making the graphics pipeline well suited to the renderingprocess. All stages of the pipeline can be used simultaneously
for different vertices or fragments, this independence allows
the graphics processor to use parallel processing units. By
using parallel processing units multiple vertices or fragments
can be processed in a single stage of the pipeline at the same
time.[35, 36]
A. OpenGL shading language
The OpenGL shading languages is known as GLSL. It is a
high-level shading language. It has been designed to allow
application programmers to express the processing that occursat the programmable points of the OpenGL pipeline. Vertex
and fragment processing is unified by GLSL in a single
instruction set. This allows branches and conditional loops.
The GLSL has five shader stages; vertex, geometry, fragment
tessellation control, and tessellation evaluation.[37, 38]
OpenGL has the benefit of having cross-platform
compatibility on multiply operating systems. Shaders that are
written can be used on any hardware vendor’s graphics card
once GLSL is supported. Each hardware vendor can create
code optimised for their particular graphics card’s architecture
because the GLSL compiler is included in their driver.[37, 38]
B. Cg programming language
Nvidia developed Cg, C for graphics which is a high-level
shading language. It was developed in close collaboration
with Microsoft for programming pixel and vertex shaders.This is not a general programming language; it is only suitable
for GPU programming. Microsoft has a similar shading
language called HLSL.[39]
Cg features API independence and a variety of freetools to improve asset management are available. It was
designed for easy and efficient production pipeline integration.
Connectors are special data structures used in Cg to link the
various stages of processing. They define the input from
application to vertex processing stage and the attributes to be
used as inputs to the fragment processing.[39]
C. DirectX High-Level Shader Language
HLSL is the high-level shader language developed by
Microsoft for DirectX and Xbox. It is a C-type shader
languages supported by DirectX and Xbox game consoles.
Shaders for the Direct3D pipeline can be created using HLSL.
There are three shader stages in the HLSL; the vertex shader,
the geometry shader and the pixel shader.[40, 41]
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 8/14
VIII. THE EVOLUTION OF GPUS
GPUs are extensively used in the computer games market.
This is a booming market and it drives the sale of the GPU.
This means that the future of the GPU is greater than that of
the general-purpose CPU. The CPU will still remain as the
main processor but there is much more potential for expanding
the computing experience using the GPU. The GPU is much
better at parallelism than the CPU, thus complex problems canbe easily solved by the GPU which can be both graphical and
non-graphical.[42]
Due to high volumes of GPUs being sold to PC
gamers, as a result of this high demand for GPUs they are
relatively inexpensive. The trade off of having high costspecial purpose hardware is thus less of a factor. According to
Moore’s Law the CPU growth doubles every 18 months and
the GPU growth doubles every 6 months. This makes it
impossible for CPU manufactures to keep up with the rapid
growth of GPU advancement. It would prove too expensive to
re manufacture a new CPU every time a new GPU chip is
released. Figure 9 shows how GPUs are obeying Moore’sLaw and CPUs are being left behind. “The graphical
processing unit is visually and visibly changing the course of
general purpose computing”[43].[42, 44]
Figure 9. Comparison of GPUs and CPUs[44]
GPU hardware architecture is moving from a singlecore hardware pipeline implementation for graphics
processing to highly parallel and programmable core for more
general purpose computing. By adding more programmability
and parallelism to a GPU core architecture it is evolvingtowards a general purpose CPU-like core.[45]
IX. GEFORCE
GeForce is a brand of GPUs designed by nVidia. The
GeForce logo is shown below in Figure 10. There are over 10
generations of the GeForce design. The original release of the
design was the GeForce 256 in 1999. The first GeForce
products were intended for the high-margin PC gamingmarket. It was designed that they would be used on add-on
graphics boards, they were discrete GPUs. All tiers of the PC
gaming market were covered in subsequent designs. NVidia’s
embedded application processors include the most recent
GeForce technology. These are designed for mobile
phones.[46, 47]
Figure 10. GeForce logo[47]
A. The GeForce 6 Series
The sixth generation of GeForce is the GeForce 6 Series. It
was released in 2004. This series can have a 4, 8, 12, or 16
pixel-pipeline GPU architecture. It contains an on-chip video
processor with full MPEG-2 encoding and decoding, and
advanced adaptive de-interlacing called PureVideo. This
design also has High Precision Dynamic Range technology
and 8 times more shading performance than previous designs.
There is DirectX 9 Shader Model 3.0 support and OpenGL 2.0
optimizations and support also.[47-49]
B. Architecture of the GeForce 6 Series
The GPU memory interface has an available bandwidth of
35GBps. The CPU memory interface has 6.4GBps available
bandwidth and the PCI express bus has 8GBps. This showsthat there is a vast amount of internal bandwidth available on
the GPU. More dramatic performance improvements can be
made by making sure that algorithms running on the GPU take
advantage of this bandwidth.[50]
Figure 11 shows the block diagram of the GeForce 6Series Architecture. It shows the process of the graphics by
which input arrives from the CPU (host) and is output aspixels drawn to the frame buffer. The CPU writes a command
stream which sets and modifies the state, references the vertex
and texture data, and sends rendering commands. These
states, commands and vertices flow down through the blockdiagram where they will be used in subsequent pipelines.[50]
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 9/14
Figure 11. GeForce 6 Series Architecture[51]
The vertex shaders/ processors, shown in Figure 12,allow a program to be applied to each vertex in the object.
Transformations, skinning and other pre-vertex operations are
performed here. All operations in this processor are done in
32-bit floating-point (fp32) precision. There can be up to six
vertex units on high-end models and there may be two on low-
end models.[50]
The vertex programs can fetch texture data. The
texture cache is shared between the fragment processor and
the vertex processor due to the fact that the vertex processor
can perform texture access. There is also a vertex cache to
store all data before and after the vertex processor.[50]
Primitives are points, lines or triangles. The vertices
are grouped into these primitives. Three blocks are using cull,
clip and set-up to perform pre-primitive operations. Primitives
that aren’t visible are removed (cull), primitives that intersect
the view frustum are clipped (clip) and edge and plane
equation set up on the data is performed for the rasterization
process (setup).[50]
Figure 12. GeForce 6 Series Vertex Processor[50]
The calculation of pixels which are covered by eachprimitive is done in the rasterization block. It uses the z-cull
block to discard pixels. A fragment will then pass through the
fragment processor where there will be tests performed on it.
Once passing the tests it will carry depth and colour
information to a pixel on the frame buffer.[50]
The fragment processor and texel pipeline is also
known as the pixel shader, Figure 13. This unit applies a
shader program to each fragment independently. There can be
a varying number of fragment pipelines on the GeForce 6
Series GPUs. Texture data is cached on-chip, similarly to the
vertex processor. This reduces bandwidth requirements and
increases performance.[50]
Figure 13. GeForce 6 Series Fragment Processor and Texel Pipeline[50]
Quads are squares of four pixels. The texture and
fragment-processing units operate on quads. This allows
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 10/14
direct computation of derivatives for calculating texture level
of detail. The texture unit fetches the data from memory for
the fragment processor and returned in fp16 or fp32 format.The texture unit can read a 2D or 3D array of data. 16-bit
floating-point precision filtering is supported by this
design.[50]
There are two fp32 shader units per pipeline in the
fragment processor. Before the fragments re-circulate throughthe pipeline to execute the next set of instructions they are
passed through both shader units and the branch processor.
This happens once every clock cycle.[50]
Once the fragments have passed through the
fragment-processing unit they are sent to the z-compare andblend units in the order in which they were rasterized. Stencil
operations, alpha blending, depth testing and the final colour
write to the target surface are performed in these units.[50]
There are four DRAMs which divide up the memory
system, all of which are independent. The memory subsystem
can operate efficiently by having smaller, independentmemory partitions. This is regardless of whether small or
large blocks of data are transferred. The streaming of 32-byte
memory access near the physical limit of 35GBps is possible
due to the four independent memory partitions giving the GPU
a wide flexible memory subsystem of roughly 256 bits.[50]
C. Challenges and Oportunities for High-Perofrmance
Computing
To achieve optimal performance of the devices there are some
actions that could be carried out. The z-cull block shown in
Figure 10, is used to discard pixels. It avoids work that
doesn’t contribute to the final result. By concluding early thata computation doesn’t contribute the z-values for all objects
can be rendered first before shading. For example, with
general purpose computing the z-cull can be used to select
which parts are still active in the computation. It will cull the
computational threads that have already been resolved.[50]
The texture math can be exploited when loading data.This unit filters data before it is returned to the fragment
processor. The total data needed by the shader is thus
reduced. The total work done by the shader can be reduced if
the texture unit’s bilinear filter is used more frequently.
Similarly when performing compares, the work form the
processor can be offloaded by using the filtering support in
shadow buffering, the result can then be filtered.[50]
By making sure that the work avoided by branching
outweighs the cost of branching, it can be very beneficial. The
fragment processor operates many fragments simultaneously.
Fragments in a group may take different branches; in this case
both branches have to be taken by the fragment processor.
This could reduce the performance of branching in programs.
However, if branching is not an effective choice, conditional
writes can be used.[50]
A full-speedfp16 normalize instruction in parallel issupported by this design. By having fp16 intermediate values
the internal storage and datapath requirements are reduced.
Instead of using fp32 intermediate values, these could be
saved for cases where the precision is needed; the performance
will be increased by using fp16 intermediate values.[50]
There is a fixed amount of register space perfragment to keep hundreds of fragments in flight by the shader
pipeline. Fewer fragments will remain in flight if this register
space is exceeded. This will reduce the latency tolerance for
texture fetches, thus adversely affecting performance. If the
register file uses fp32x4 values exclusively it may run out of
read and write bandwidth to feed all units. There is enoughbandwidth if reading fp16x4 values to keep all units busy.[50]
Extraordinary new performances are delivered from
this new design. They streamline the creation of stunning
effects in games and other 3D real-time applications. The
hardware power that is needed to create such detailed and
vibrant images won’t be too intense on the PC due to this newarchitecture.[52]
The new superscalar shader architecture in this
design double the number of operations expected per cycle.
There is a significant performance increase as a result. A full
32-bit floating point precision is provided to deliver higher
quality images. Developers can implement stunning visual
effects. There is no compromise of speed for quality in this
design.[52]
X. COMPARISON OF GPU PROGRAMMING MODELS TO
DESKTOP MULTICORES
A. The difference between CPU and GPU
The CPU is the central processing unit; it is the brain of the
computer system. The GPU is a graphics processing unit.
The GPU is a complimentary processing unit which handles
the computation of intensive graphics processing. The rest ofthe application still runs on the CPU. The application runs
faster from a user’s perspective due to the processing power of
the GPU to boost performance. Hybrid computing is using the
GPU as a co-processor to the CPU. Graphics processing isparallel, therefore it can be easily parallelized and
accelerated.[2, 53]
B. How a multicore system differs from a GPU
A CPU is designed with a few cores; it can consist of 4 to 8
cores. These cores can handle a few software threads which
can be exploited in an application program. Figure 14 shows
an example of a CPU with multiple cores. Compared to singlecore predecessor, multi-core CPUs can operate lower
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 11/14
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 12/14
first mobile super chip…with the first mobile dual-core
CPU”[58]. The new Tegra 3 which has quad core processing
has the 4-PLUS-1™ battery-saver technology which providesgreat mobile performance.[57, 58]
LG, Motorola and Samsung are among some of the
phones which are powered by Tegra 2[59]. There are a long
list of tablets powered by Tegra 2, the most popular ones
among these are the Samsung Galaxy Tablet, Sony Tablet and
Toshiba Thrive[60].
The challenges of HD video playback, streaming
videos and 3D gaming etc for power consumption and
performance have been faced previously by desktop and
notebook CPUs. Now mobile application processors are
facing this challenge which stretches the capabilities of current
single core mobile processors. To increase their performanceand stay within mobile power budgets mobile processors need
to have multi-core processors.[54, 61]
The Tegra 2 was designed to harness the power of
Symmetrical Multiprocessing which delivers a higher
performance and lowers power consumption. It offers faster
Web page loading times a higher quality of game play withfaster multitasking features and tremendous battery life
improvements.[58, 61]
XII. DISCUSSION
GPUs are more effective than general-purpose CPUs due to
the processing of large blocks of data in parallel. They are
used along-side CPUs for this purpose, by performing those
mathematically intensive tasks it relieves the strain that would
have been put on the CPU and it is freed up to perform other
tasks.[1, 5]
CUDA and OpenCL are the main contender
languages for GPU programming along with DirectCompute
from Microsoft. By implementing OpenCL correctly for the
target architecture it performs “no worse” than CUDA.
CUDA is a more mature language due to its code and user-
base. OpenCL is missing the relevant middleware tools andlibraries that CUDA has. As the OpenCL toolkit matures the
gap between it and the CUDA toolkit will converge.[5, 12, 26]
The GPU has begun to evolve from a single core,
fixed function hardware pipeline implementation just for
graphics rendering, to a set of highly parallel and
programmable cores for more general purpose computation.
The architecture of many-core GPUs are starting to look more
and more like multi-core, general purpose CPUs.[45]
A single core CPU runs at higher clock frequencies
and voltages than a multi-core CPU and it takes longer periods
of time to complete a given task. By distributing the workload
across multiple CPU core is known as workload sharing on
multi-core CPUs each CPU core can run at lower frequencies
and voltages to complete multi-threaded tasks. Significantly
less power is consumed by each core they and offers a higherperformance per watt than single core CPUs due to the lower
operating frequencies and voltages.[61]
Nvidia developed the Tegra to harness the power of
multi-core CPUs to deliver a higher performance and lower
power consumption on mobile devices. There are tremendous
battery life improvements as a result along with extrememultitasking features, a better game playing experience and
faster Web browsing.[54, 61]
REFERENCES
[1] Wikipedia. (2012, 4th April 2012). Graphics
processing unit . Available:
http://en.wikipedia.org/wiki/Graphics_processing_un
it
[2] nVidia. (2012, 12th April 2012). What is GPU
computing? Available:http://www.nvidia.com/object/GPU_Computing.html
[3] TechTerms. (2012, 6th April 2012). GPU . Available:
http://www.techterms.com/definition/gpu
[4] nVidia. (2012, 6th April 2012). GeForce 256 .
Available:
http://www.nvidia.com/page/geforce256.html [5] Wikipedia. (2012, 5th April 2012). CUDA.
Available: http://en.wikipedia.org/wiki/CUDA
[6] nVidia. (2012, 6th April 2012). What is CUDA.
Available: http://developer.nvidia.com/what-cuda
[7] nVidia. (2012, 6th April 2012). CUDA FAQ.
Available: http://developer.nvidia.com/cuda-faq
[8] J. Cohen. (2009, 13th April 2012). CUDA Librariesand Tools. Available: http://gpgpu.org/wp/wp-
content/uploads/2009/11/SC09_CUDA_Tools_Cohe
n.pdf
[9] DAAC. (2009, 13th April 2012). CUDA
programming model. Available:http://www.visualization.hpc.mil/wiki/CUDA_Progra
mming_Model
[10] M. F. Ahmed. (2010, 6th April 2012). CUDA -
Computer Unified Device Architecture. Available:
http://mohamedfahmed.wordpress.com/2010/05/03/c
uda-computer-unified-device-architecture/
[11] nVidia. (2009, 13th April 2012). Nvidia CUDA
Architecture. Available:
http://developer.download.nvidia.com/compute/cuda/
docs/CUDA_Architecture_Overview.pdf
[12] Wikipedia. (2012, 5th April 2012). OpenCL.
Available: http://en.wikipedia.org/wiki/OpenCL
[13] Khronos. (2012, 5th April 2012). OpenCL. Available:
http://www.khronos.org/opencl/
[14] Intel. (2012, 5th April 2012). Intel OpenCL SDK .
Available: http://software.intel.com/en-
us/articles/vcsource-tools-opencl-sdk/
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 13/14
[15] AMD. (2011, 5th April 2012). OpenCL Zone.
Available:
http://developer.amd.com/zones/openclzone/Pages/default.aspx
[16] nVidia. (2012, 5th April 2012). OpenCL. Available:
http://developer.nvidia.com/opencl
[17] ARM. (2012, 5th April 2012). Khronos Standards.
Available:
http://www.arm.com/community/multimedia/standards-apis.php
[18] IBM. (2012, 5th April 2012). OpenCL. Available:
http://researcher.ibm.com/view_project.php?id=1835
[19] Apple. (2012, 5th April 2012). OpenCL. Available:
https://developer.apple.com/softwarelicensing/agree
ments/opencl.html[20] C.-R. Lee. (2010, 13th April 2012). CUDA
Programming. Available:
http://www.cs.nthu.edu.tw/~cherung/teaching/2010gp
ucell/CUDA06.pdf
[21] B. Alun-Jones. (2010, 13th April 2012). An Quick
Introduction to OpenCL. Available:http://www.mat.ucsb.edu/594cm/2010/benalunjones-
rp1/index.html
[22] W. W. Hwu, Stone, J. (2010, 13th April 2012). The
OpenCL Prgoramming Model. Available:
http://www.ks.uiuc.edu/Research/gpu/files/upcrc_ope
ncl_lec1.pdf
[23] W. W. Hwu, Stone, J. (2010, 13th April 2012). The
OpenCL Prgoramming Model. Available:
http://www.ks.uiuc.edu/Research/gpu/files/upcrc_ope
ncl_lec2.pdf
[24] DrZaius. (2009, 13th April 2012). Matrix
Multiplication 2 (OpenCL). Available: http://gpgpu-
computing4.blogspot.com/
[25] NERSC. (2011, 13th April 2012). Performance and
optimization. Available:
http://www.nersc.gov/users/computational-
systems/dirac/performance-and-optimization/
[26] Wikipedia. (2012, 5th April 2012). DirectCompute.
Available:
http://en.wikipedia.org/wiki/DirectCompute
[27] nVidia. (2012, 6th April 2012). DirectCompute.
Available: http://developer.nvidia.com/directcompute
[28] Microsoft. (2010, 6th April 2012). DirectX11
DirectCompute. Available:
http://www.microsoftpdc.com/2009/P09-16
[29] nVidia. (2010, 15th April 2012). DirectCompute
Programming Guide. Available:
http://developer.download.nvidia.com/compute/DevZ
one/docs/html/DirectCompute/doc/DirectCompute_P
rogramming_Guide.pdf
[30] (15th April 2012). OpenGL Programming Guide.
Available:
http://www.glprogramming.com/red/chapter01.html
[31] Khronos. (2012, 15th April 2012). OpenGL - The
Industry's Foundation for High Performance
Graphics. Available: http://www.khronos.org/opengl
[32] M. Segal and K. Akeley. (2011, 15th April 2012).
The OpenGL Graphics System: A Specification.
Available:http://www.opengl.org/registry/doc/glspec42.core.20
110808.pdf
[33] Wikipedia. (2012, 15th April 2012). OpenGL.
Available: http://en.wikipedia.org/wiki/OpenGL
[34] Apple. (2012, 15th April 2012). OpenGL
Programming Guide for Mac OS X . Available:https://developer.apple.com/library/mac/#documentat
ion/GraphicsImaging/Conceptual/OpenGL-
MacProgGuide/opengl_intro/opengl_intro.html
[35] Wikipedia. (2012, 15th April 2012). Shading
language. Available:
http://en.wikipedia.org/wiki/Shading_language [36] Wikipedia. (2012, 15th April 2012). Graphics
pipeline. Available:
http://en.wikipedia.org/wiki/Graphics_pipeline
[37] J. Kessenich, D. Baldwin, and R. Rost. (2011, 15th
April 2012). The OpenGL Shading Language.
Available:http://www.opengl.org/registry/doc/GLSLangSpec.4.
20.8.clean.pdf
[38] Wikipedia. (2012, 15th April 2012). GLSL.
Available: http://en.wikipedia.org/wiki/GLSL
[39] Wikipedia. (2012, 12th April 2012).
Cg(programming language). Available:
http://en.wikipedia.org/wiki/Cg_(programming_lang
uage)
[40] Microsoft. (2012, 15th April 2012). Programming
Guide for HLSL. Available:
http://msdn.microsoft.com/en-
us/library/windows/desktop/bb509635(v=vs.85).aspx
[41] Wikipedia. (2012, 15th April 2012). High Level
Shader Language. Available:
http://en.wikipedia.org/wiki/High_Level_Shader_Lan
guage
[42] T. S. Crow, "Evolution of the Graphical Processing
Unit," Master of Science, Computer Science,
University of Nevada, Reno, 2004.
[43] M. Macedonia, "The GPU Enters Computing's
Mainstream," Computer, vol. 36, pp. 106-108, 2003.
[44] nVidia. (2011, 16th April 2012). NVIDIA CUDA C
Programming Guide. Available:
http://developer.download.nvidia.com/compute/DevZ
one/docs/html/C/doc/CUDA_C_Programming_Guide
[45] C. McClanahan, "History and Evolution of GPU
Architecture," 2010.
[46] Wikipedia. (2012, 15th April 2012). GeForce 256 .
Available: http://en.wikipedia.org/wiki/GeForce_256
[47] Wikipedia. (2012, 15th April 2012). GeForce.
Available: http://en.wikipedia.org/wiki/GeForce
[48] Wikipedia. (2012, 15th April 2012). GeForce 6
Series. Available:
http://en.wikipedia.org/wiki/GeForce_6_Series
8/14/2019 Graphics Processing Units Paper.pdf
http://slidepdf.com/reader/full/graphics-processing-units-paperpdf 14/14
[49] nVidia. (2012, 15th April 2012). GeForce 6 Series.
Available:
http://www.nvidia.com/page/geforce6.html [50] E. Kilgariff and R. Fernando, "Chapter 30 The
GeForce 6 Series GPU Architecture," in GPU Gems
2, M. Pharr, Ed., ed, 2005.
[51] G. Chunev. (2009, 15th April 2012). Graphics
Processing. Available:
http://www.cs.indiana.edu/~gnchunev/files/Lecture.pdf
[52] nVidia. (2012, 16th April 2012). High-Performance,
High-Precision Effects Available:
http://www.nvidia.com/object/feature_HPeffects.html
[53] R. Ragel. (2011, 13th April 2012). Difference
Between CPU and GPU . Available:http://www.differencebetween.com/difference-
between-cpu-and-vs-gpu/
[54] nVidia, "The Benifits of Quad Core CPUs in Mobile
Devices," 2011.
[55] nVidia. (2012, 17th April 2012). Getting Started with
Tegra. Available: http://developer.nvidia.com/tegra-start
[56] R. Pogson. (2011, 17th April 2012). Nvidia Tegra2
block diagram. Available:
http://mrpogson.com/2011/04/03/ [57] nVidia. (2012, 17th April 2012). NVidia Tegra
Mobile Processor Features. Available:
http://www.nvidia.com/object/tegra-features.html
[58] nVidia. (2012, 17th April 2012). NVIDIA Tegra 2.
Available: http://www.nvidia.com/object/tegra-
2.html [59] nVidia. (2012, 17th April 2012). Tegra Super
Phones. Available:
http://www.nvidia.com/object/tegra-
superphones.html
[60] nVidia. (2012, 17th April 2012). Tegra Super
Tablets. Available:http://www.nvidia.com/object/tegra-supertablets.html
[61] nVidia, "The Benifits of Multiple CPU Cores in
Mobile Devices," 2011.