cuda algorithms for java

1

CUDA Algorithms for JAVA

Kosta Krauth

B.Sc. in Computing Science

Department of Computing Science

Griffith College Dublin

May 2009

Submitted in partial fulfilment of the requirements of the Degree of

Bachelor of Science at Griffith College Dublin

2

Disclaimer

I hereby certify that this material, which I now submit for assessment on the

programme of study leading to the Degree of Bachelor of Science in Computing at

Griffith College Dublin, is entirely my own work and has not been submitted for

assessment for an academic purpose at this or any other academic institution other

than in partial fulfilment of the requirements of that stated above.

Signed: Date:

3

Abstract

This paper describes in detail a method of invoking NVIDA CUDA algorithms from

within JAVA. The points of focus are performance, ease of use, feasibility of the

approach and the possibility of expanding the proposed library of algorithms through

the open-source community. It also covers the implementation of an on-line self-

service CUDA to JAVA library compiler.

4

Table of Contents

Introduction and overview ............................................................................................. 6

Distributed computing and the Internet ..................................................................... 6

Parallel computing and Moore’s law ......................................................................... 6

Meanwhile, in the GPU world... ................................................................................ 7

The birth of GPGPU .............................................................................................. 8

The project’s aim ....................................................................................................... 9

Requirements analysis and specification ..................................................................... 10

CUDA API ............................................................................................................... 10

Platform independence............................................................................................. 11

JAVA JNI............................................................................................................. 11

SWIG ................................................................................................................... 12

Virtualization ....................................................................................................... 13

The Web Page .......................................................................................................... 13

Library download page ........................................................................................ 14

Algorithm description page .................................................................................. 15

Self-service compiler ........................................................................................... 16

Community / Open Source aspects ...................................................................... 17

Design Structure........................................................................................................... 18

System Components................................................................................................. 18

Operating System ................................................................................................. 18

CUDA Driver and Compiler ................................................................................ 18

The Virtual Machine ............................................................................................ 19

LAMP .................................................................................................................. 19

Zend Framework .................................................................................................. 19

Shell Scripting and Backing Services .................................................................. 20

CUDA algorithms .................................................................................................... 21

Parallel Sort (Bitonic) .......................................................................................... 21

GPU-Quicksort .................................................................................................... 23

Black-Scholes Option Pricing .............................................................................. 25

Fast Fourier Transform ........................................................................................ 26

Matrix Transform ................................................................................................. 27

Implementation Details ................................................................................................ 28

Algorithms ............................................................................................................... 28

SWIG & JNI ........................................................................................................ 28

5

CUDA algorithm adaptation ................................................................................ 30

Reusing the CUDA compiler configuration ........................................................ 33

The Virtual Machine ............................................................................................ 35

Test suites............................................................................................................. 37

The web site ............................................................................................................. 39

General features ................................................................................................... 39

Library download page ........................................................................................ 40

Self-service compiler ........................................................................................... 40

Project Tracking (TRAC) .................................................................................... 42

Conclusion ................................................................................................................... 42

Meeting the Objectives ........................................................................................ 42

Taking a different approach ................................................................................. 43

Bibliography ................................................................................................................ 44

6

Introduction and overview

Distributed computing and the Internet

The research done in the areas of distributed computing and processing has been

constantly on the rise in the last decade. Being a distributed architecture itself, the

Internet has been one of the major drives for the advancements in the field.

Cluster computing has been actively used for processing and organizing vast amounts

of data on the Internet. Google is one of the pioneers in this area with its MapReduce

model which is capable of processing terabytes of data across thousands of machines.

[1] Other popular fields where cluster computing is being actively used on the Internet

is for ensuring availability for high-traffic web servers and providing scalability to

hosting platforms.

Grid computing differs from cluster computing mainly in the fact that it is loosely

coupled, meaning the participating nodes can all operate on different hardware and

software. Many popular distributed applications use this architecture, most notably

projects like Folding@Home and SETI@Home. By taking advantage of all computers

connected to the internet Folding@Home has been able to create a grid with the

combined computational power exceeding 8 petaflops. [2] As a comparison, the

current fastest supercomputer (IBM’s Roadrunner) achieves a sustained rate of 1.1

petaflops.[3]

More recently a completely new paradigm called “cloud computing” has emerged,

sometimes used as a metaphor for the Internet itself. Clouds can be defined as large

pools of easily usable and accessible virtualized resources (such as hardware,

development platforms and/or services). These resources can be dynamically

reconfigured to adjust to a variable load (scale), allowing also for an optimum

resource utilization. [4] They have been widely used in providing web and application

hosting platforms such as Amazon’s EC2 services and most of Google’s services like

Gmail and AppEngine.

Because of all the advantages and benefits of using distributed computing in

networked environments, as we will see below, the paradigm has successfully been

applied to non-networked systems too.

Parallel computing and Moore’s law

Moore’s Law was proposed by Intel co-founder Gordon Moore and it states that the

number of transistors on a chip will double about every two years. [5] It is often said

that this is not so much a law as it is a pace the manufacturers are constantly trying to

match in any way imaginable. Intel has especially been hard at work in trying to keep

up with the law considering it was proposed by one of its co-founders.

7

As shown in Figure 1, around 1995, the ability to decrease the size of an IC

(integrated circuit) had slowed down dramatically as physical limits were reached.

The transistors were already so small that the electrons had started to jump from one

to the other in an unpredictable fashion, making it impossible to produce a stable and

reliable processor.

Then, as Figure 1 clearly shows, in 2004 something happened and the transistor count

increased by an order of magnitude. In 2005 this behaviour occurred again. In order to

keep up with Moore’s law, Intel proposed a different architecture - one in which the

processor cores were multiplied instead of miniaturized. As such, on 7th

September

2004, Intel had introduced its first multi-core CPU and announced parallelism as its

main microprocessor philosophy. [6] This happened again in early 2006 with the

introduction of quad-core CPUs, hence the second jump shown on Figure 1.

This was a major shift that affected both electrical engineers and computer

programmers alike. The goal was no longer to squeeze out every last megahertz out of

the processors clock, but rather parallelize the programs in the way that they could

spread their tasks into working units that could run on multiple cores. This proved to

be a very successful and beneficial model, and has continued to be the main driving

force behind CPU development until this day.

Meanwhile, in the GPU world...

CPU manufacturers weren’t the only ones active in this field. As a matter of fact,

there were companies active in this field years before Intel. These were the

manufacturers of graphical processing units (GPUs), with Silicon Graphics

International and 3dfx Interactive being the pioneers in the field. As the demand for

virtual realities that match our own increased (both for CAD and entertainment

industries), graphics cards had to employ multiple processors designed specifically for

performing a single task. These were primarily graphical operations such as pixel

shading, rasterization and vertex rendering.

Figure 1, source Wolfram Alpha

8

As time progressed, virtual reality started matching our own more and more precisely,

but this came at a computational cost. With increase in resolution and introduction of

various algorithms that shaped virtual worlds in the way to make them look more

believable (most notably anti-aliasing and bilinear filtering), the manufacturers of

graphical processing units found themselves adding more and more of these

specialized processors that could crunch more data in less time, providing higher

frame rates and better quality. All these operations were well suited for being

executed in parallel and as such the GPUs started multiplying their processors much

earlier than CPUs.

Even though these specialized processors were much slower than modern CPUs, they

came in very high numbers, so in early 2000s people started wondering if there was a

more general way in which they could be used.

The birth of GPGPU

This gave rise to a whole new research field called General-Purpose computation on

Graphics Processing Units (GPGPU). It started off as a very small community of

scientists and researchers that explored various ways in which graphical computations

could be mapped to a more general set of instructions.

Initial efforts produced simple distributed algorithms for searching, sorting and

solving various problems in the scientific community. The initial results were very

promising. The advantages of using such a massively parallel architecture showed

extreme benefits for certain computations. Obviously getting texture transformation

computations to perform something that even resembles operations from basic linear

algebra was a long, tedious and hacky process.

Eventually the manufacturers of graphics cards realised that their architectures could

be used for more than just 3D acceleration. In November of 2006, both major graphics

card manufacturers of the time (ATI and NVIDIA) released their flavours of a

GPGPU API for their hardware. ATI’s (acquired by AMD) implementation was

called Close To Metal (CTM). CTM gave developers direct access to the native

instruction set and memory of the massively parallel computational elements in AMD

Stream Processors. Using CTM, stream processors effectively become powerful,

programmable open architectures like today’s central processing units (CPUs). [7]

NVIDIA’s API was called Compute Unified Device Architecture (CUDA). CUDA

expanded on the features and ideas of CTM, offering full support for BLAS (Basic

Linear Algebra Subprograms) and FFT (Fast Fourier Transform) operations. [8]

This was made possible by exploiting programmable stream processors that can

execute code written in a common language like C. Architecturally these processors

are very simple and have limited instructions sets, but their real power lies in

numbers, not speed. As an example, NVIDIA’s current flagship graphics card, the

GeForce GTX 295 contains 480 stream processors, each having an internal clock of

1242 MHz. [9]

9

Figure 2, source: NVIDIA

As illustrated by Figure 2, the combined computational output of these processors

(measured in floating point operations per second) far exceeded that of mainstream

Intel CPUs. This performance doesn’t come for free however. Getting the maximum

out of a graphics card is an involved and highly complex task. The peculiarities of the

GPU architecture have a huge impact on the performance of algorithms executed on

them. Therefore, in depth knowledge of the processor and memory architecture is

required in order to achieve maximum possible efficiency and throughput, as well as

avoid any pitfalls in the process

The project’s aim

CUDA has been a great success even since its introduction. Many mainstream

projects have started using the benefits of sharing the workload between CPU and

GPU, depending on the task at hand. Adobe’s latest Photoshop has GPGPU

functionality integrated, so do Folding@home and SETI@home projects. CUDA’s

homepage is also bursting with scientific applications ranging from physics, to

chemistry, biology, mathematics, AI and medicine.

So the benefits are obviously there, yet CUDA remains a closed community

consisting mostly of scientists and researchers. This is mostly due to high entry

barrier due to the specifics of the GPU architecture and a relatively obscure API that

is only accessible using C++ and that a non-mathematical person would have a hard

time following and understanding.

This projects aim is to make the benefits of CUDA available to a wider audience; by

using an easy to use, ready to deploy JAVA library. The approach taken in order to

assess the feasibility of such a solution is covered in detail. While it is more of a proof

of concept than anything else, the library shows potential to grow and expand through

open source community contributions, accessible through a web page.

10

Requirements analysis and specification

CUDA API

As previously mentioned, CUDA is essentially an extension of C. All of the memory

allocations, pre-calculations and variable and data initializations are done in regular

C. CUDA then builds on top of that and introduces several specific symbols and

keywords that only the CUDA compiler can process. As such, a CUDA program

usually consists of two main parts – a C program that initializes the memory on the

device and host, and prepares the data that needs to be processed, and a CUDA file

that contains the kernel which is uploaded to the graphics card and executed on the

stream processors.

The kernel file is compiled by the CUDA compiler and cloned over all the available

processors on the graphics card. Prior to that, the dataset needs to be uploaded to the

graphic cards main memory. Once these steps have been completed, the processors

will start performing calculations on the available data and storing the results. The key

to achieving maximum performance lies in optimal use of the shared memory (as

opposed to global texture memory which is much slower), good synchronization of

threads and maximizing the number of busy cores at any given moment. The optimal

number of threads is different depending on the graphics cards, but most modern ones

benefit most from thread numbers running in the thousands. [8]

Figure 3, GPU memory diagram, source (10)

Figure 3 illustrates how the memory on a graphics card is divided into a grid that

contains a number of blocks inside which the threads run. All threads operating within

the same block can synchronize their execution. Access to global memory is much

11

slower than to shared memory and registers, the same way a CPU can access its

registers much faster than RAM. Therefore, global memory should only be used when

retrieving new datasets and once the data in shared memory has been fully processed.

The second most time consuming and costly operation is transfer of memory between

the host (CPU) and device (GPU). All memory needs to be pre-allocated, and data

transferred before processing starts, adding quite a bit of latency to the task. This is

why GPUs are best at performing computationally intensive tasks that have low

number of memory accesses and high number of iterations on the same dataset. This

way, a dataset can be uploaded only once, and the threads can crunch data

independently, without wasting time competing for resources and synchronizing with

each other.

Since the CUDA API direct allows access to all of these subsystems, programmers

need to be very careful when writing programs. Naive implementations of algorithms

can be up to 100 times slower, therefore

Platform independence

One of the first things that had to be considered during the planning stage of the

project was how to solve the problem of combining JAVA’s platform independence

with CUDA’s complete platform dependence. The following chapters show how this

challenge was approached.

JAVA JNI

Since the CUDA API and compiler are C based, integration with JAVA is far from

trivial. Therefore, the only way to consume these algorithms from JAVA is through

use of the JAVA Native Interface.

The JNI allows JAVA applications to load precompiled binary libraries written in a

language like C, and exchange data with the methods residing within them. Figure 4

shows a very high level diagram of where and how the native applications are

invoked.

Figure 4, JNI diagram, source (11)

12

Even though this is a great way to exploit libraries written in other programming

languages, by using JNI, JAVA loses its platform independence. Since C libraries are

platform dependent, it means that a JAVA program consuming this library can only

be run on the system that the library was compiled on. For example, when compiling

shared libraries, Windows will produce Dynamic-Link Libraries (DLLs) and Linux

will produce Shared Objects (SOs). To make it possible for an application to run on

both operating systems, both of these need to be present.

Fortunately, JAVA comes with mechanisms that make this runtime loading of JNI

libraries somewhat easier. As long as the library resides on the JAVA class path, we

only need to provide its name, and JAVA will decide if it should load the .so or .dll

depending on what operating system it is running on. However, in order to produce

library that should be easily pluggable into existing JAVA programs, and run on a

majority of operating systems, it should come with at least Windows and Linux

versions. The way this problem was solved is be described in subsequent chapters.

Another downside of JNI is that it was designed primarily for supporting legacy

systems. As such, it is very verbose and writing the “glue” is a long and tedious

process, involving lots of low-level code that JAVA programmers are usually not very

familiar with. Because of this, one of the requirements of the project should be that

the users are completely abstracted from JNI. The only thing they would need to do is

call to the native method, passing the relevant data to it, and the rest would be

completely transparent.

SWIG

In order to avoid manual writing of JNI code, and enable developers to quickly wrap

their CUDA algorithms into JNI compatible libraries, an automated approach was

necessary. After some research, the Simplified Wrapper and Interface Generator, or

shorter, SWIG [10] seemed to have all the functionality required. What SWIG does is

generate the wrapper code (the “glue”) between JAVA and C programs.

Figure 5, SWIG architecture

13

As seen from Figure 5, SWIG takes an interface file as input, parses it and then, using

the code generator, it creates the wrappers from it. The wrappers can be generated for

Pyhton, Perl, Tcl and many other languages, but for this project’s particular use, only

the JAVA JNI one is needed.

All in all, SWIG was another step towards achieving complete automation between

supplying the CUDA source files and receiving ready to use, platform specific, JNI

compatible libraries. The generation of these interface files and further explanation of

how the wrappers fit into the system will be given in subsequent chapters.

Virtualization

As previously mentioned, even with the JNI wrappers generated, the problem of

platform dependence still remained. The only way to go around this was to try and

cover most of the platforms in one go. Windows and Linux cover over 90% of the

installed platforms worldwide so it was a good place to start. Somehow, a way to

compile the source files on those two operating systems simultaneously had to be

devised.

The first thing that came to mind was to have 2 machines, one running Linux, and the

other running Windows. However, for the purposes of the demo this wasn’t very

practical, and maintaining two separate computers to perform that task seemed

superfluous.

A much more sane approach would be to run two systems in parallel on one

computer, by using a virtual machine. Linux was chosen as the base OS and installed

Windows XP as virtual machine inside it. This proved to be a good solution as

compilation tasks could be passed to both systems at once and executed in parallel.

This approach worked very well and the details of how it was streamlined with the

rest of the process will be described in subsequent chapters.

With the JNI interface generator in place, as well as compiling environments set up on

the Linux base OS and Windows virtual machine the system begun taking shape.

The Web Page

The final steps were exposing the functionality of this automated compilation system

to the outside world, and creating some basic downloadable algorithms for the library.

The open source community is a vibrant and active one, so it would be worthwhile to

allow CUDA programmers to contribute their own algorithms to the library, therefore

increasing its usefulness and scope.

The page consists of 4 main sections:

Library download page

Algorithm description page

Self-service compiler

Community / Open Source aspects

outlined in chapters below.

14


The first one is the download page where users can hand-pick the library components

required, and based on the choices made the system then builds a custom-made

package and sends it to the user. Below is a screenshot of a portion of this page:

Figure 6, Library download page

As you can see from Figure 6, the precompiled libraries are sorted by category they

fall into. The users get the option to check the boxes next to ones they are interested in

or checking further information by clicking the “Details...” link. Finally, by clicking

on the “Download selection” button, the system will go and fetch all the individually

selected precompiled libraries, package them into an archive and send to the user.

15

Algorithm description page

The algorithm description page basically lists all the relevant information about it,

outlining details such as the general information, available methods and benchmarks.

Screenshot below shows one such page:

Figure 7, algorithm description page

As seen from Figure 7, general information about the algorithm is followed by a table

that shows the exact methods that can be accessed within the library. It also lists the

parameter data types that should be used and any special rules that need to be

followed. Finally, the graph under the benchmark heading displays a line-chart that

compares the performance of the CUDA algorithm to its CPU equivalent, as well as

the information about the system the benchmark was run on.

16


The self-service compiler exposes the automated-compilation system to the outside

world. It allows anyone to submit a CUDA based algorithm and immediately receive

a compiled JNI library that can be consumed from a JAVA program. Screenshot

below shows the page:

Figure 8, self-service compiler

I wanted to keep the self-service compiling functionality as simple as possible, ideally

asking for least possible input. This required the convention over configuration

approach, as there are many factors that determine how the SWIG interface file

should be generator. Therefore, there are certain rules that need to be followed when

submitting the source, defined by the opening paragraph. These will be explained in

greater detail later on.

17

Community / Open Source aspects

Finally, the last major component of the website is the community aspect of it. In

order to make it accessible and intuitive, on a platform that people are used to, Trac

[11] was installed. The screenshot below shows one of its pages:

Figure 9, TRAC

Trac is a well established system for project and source code management. It consists

of a Wiki, an issue tracking mechanism, a complete bug/ticket system, source control

access and browsing, roadmaps and more. Overall it’s a practical way to keep the

development of an open source project under control and make it consistent with an

obvious plan and roadmap.

With these major components in place, the website should serve as a good starting

point for a person to get introduced with the idea behind the project, the service it

offers, and the various algorithms that are precompiled and readily downloadable.

18

Design Structure

System Components

Operating System

The first choice that had to be made was what operating system should be chosen as

the platform. The CUDA SDK and drivers are compatible with all major operating

systems – Windows, Linux and MacOS. However, since this project required various

technologies that had to be streamlined into a single process flow, all running behind

a web server, Linux was chosen primarily for its versatility when it comes to shell

scripting and ability and excellent support for scripting languages in general.

All the major Linux distributions are supported by CUDA but we decided to use

Ubuntu for couple of reasons. Ubuntu is one of the most popular Linux distributions,

and as such, every product and library is almost guaranteed to support it. Also, the

community support is top-notch as the forums and mailing-lists are bursting with

activity. All this together ensured that there would be a good chance to overcome any

potential, seemingly insurmountable, problems.

Finally, Ubuntu comes with a very good software manager that makes installation of

web servers, databases, libraries and programming languages a breeze. This was

important as all these backing components were not a research part of the project, so

their deployment should have been as quick and painless as possible.

CUDA Driver and Compiler

Once the operating system was in place, the second required component was the

CUDA driver, toolkit and software development kit.

Unfortunately, the installation of this is not as quick and painless as one would hope

for. At the time when the project was started, the only Ubuntu distribution that CUDA

supported was 8.04. The installation was done quickly using a debian package, but

getting the driver and compiler to run successfully was a long and tedious process.

CUDA doesn’t ship with all its dependencies, and neither does Ubuntu ship with them

by default. All of these had to be tracked down manually and installed. Once these

were in place, the NVIDIA display driver had to be installed, which meant modifying

the module installation configuration file and replacing the default driver with the

NVIDIA one.

Finally, certain environment variables had to be added to the shell profile scripts in

order to successfully resolve all the dependencies during compile time. With all of

this in place, it was possible to compile and run CUDA programs that ship with the

SDK.

19

The Virtual Machine

A choice had to be made as to which virtual machine to use in order to run the

Windows compile environment. As previously mentioned, this was required in order

to ship the final library with both Linux and Windows compatible algorithms.

The final choice had to be made between VMWare and VirtualBox. The latter was

chosen mostly due to the fact that VirtualBox is open source and licensed under the

GNU General Public License [12]. As such, the community support is free, and it has

been a tried and tested VM on the Linux platform for a long time.

Windows XP was chosen for its stability, speed and moderate demand for resources.

The installation of the CUDA compile environment was quick and painless. Express

edition of the C++ was installed as the CUDA backing compiler.

In order to facilitate communication between the virtual machine and host OS, a

dedicated port was opened on both ends in order to exchange data. The virtual

machine had to share the hosts network interface card in order to make this possible.

With all of this in place, the basic building blocks have been in place. This was a solid

platform on top of which the remainder of the system could be built.

LAMP

In order to quickly get up to speed in terms of hosting the page, the LAMP

environment was chosen. LAMP is an acronym for Linux, Apache, MySQL, and

PHP. This is a widely used and standard environment for building rich web

applications.

Ubuntu is often used as a web server so installing the LAMP environment was quick

using the synaptic package manager. Once the software was installed, Apache’s

virtual hosts had to be configured in order to enable http access to specific folders on

the system. To simulate the on-line environment more closely, the hosts file was

modified so that a working domain could be assigned to a virtual host on my local

machine. This way the on-line experience could be replicated without relying on an

internet connection.

MySQL would be used for storing the news, algorithm information and TRAC.

Finally, PHP would be used for programming the actual web application.

Zend Framework

Writing a web application by starting with a clean slate is a perfectly valid method,

however, with the incredible growth of the Internet and the number of pages hosted,

there have been many advancements in web application building techniques. There

are countless frameworks out there that help kick-starting the development process in

a structured and organized way.

More than anything, the MVC (model – view – controller) design pattern has been

shown to map very well to web based applications. This clear separation of the

display logic, business logic and database logic usually results in much cleaner code

and overall better organization.

20

When choosing an appropriate framework for backing my web application, many

aspects had to be considered. In the end, Zend Framework was chosen as a stable,

feature-rich and extensible platform. Also, Zend was developed by the makers of PHP

itself so it uses best practices when it comes to the implementation and extending of

the PHP object model.

Although with ZF is not as quick and easy to get an application on its legs as with

some other frameworks out there, it was chosen because it offers numerous modules

and libraries that would speed up the deployment and development in the long run.

Shell Scripting and Backing Services

With the OS, compile environment, web server, database and virtual machine in

place, all of the separate components still had to come together behind the scenes in

order to provide the required functionality.

Python is used for executing the compile directives, moving files around and setting

up directories. Python is an excellent choice for doing shell scripting due to the fact

that it is very minimalistic, has excellent support for executing OS commands, backed

by numerous 3rd

party libraries and very well written documentation. Python was

chosen over Perl due to personal preference since the two offer similar functionality,

although Python is considered to be modern and slick in comparison.

In order to get a better understanding of the system and how it comes together, Figure

10 shows a high-level overview:

Figure 10, system overview

21

CUDA algorithms

With all of the above components working together, everything was in place for

supporting the library of JAVA compatible CUDA algorithms. In order to try and

demonstrate the potential versatility of the library, we tried to cover a wide range of

areas where these algorithms could be applied. A high level overview of the

algorithms, their functionalities, performance and any pitfalls will be outlined below.

Note: all tests were executed on an NVIDIA 8600M-GT 512MB graphics card and a

Core 2 Duo T7500 CPU (2.2 Ghz). All of the GPU labelled algorithms show the

performance of the modified algorithms when called from JAVA through the JNI

interface.

Parallel Sort (Bitonic)

The first algorithm implemented was a sort, due to the fact that sorting is a commonly

used function in most applications. It is also a tricky algorithm to parallelize as there

are numerous ways in which this can be done, each one having its advantages and

disadvantages.

The parallel sort implementation was created by Alan Kaatz at the University of

Illinois at Urbana-Champaign. This algorithm is an implementation of the bitonic

sorter which is a sorting algorithm designed specifically for parallel machines. Bitonic

sorter is an implementation of a sorting network. It works by comparing and swapping

elements in pairs. These paired subsets are sorted then merged.

Figure 11, parallel bitonic sort [13]

22

Figure 11 clearly shows why bitonic sort, even though not being the most efficient

algorithm, is indeed very suitable for execution in massively parallel environments.

Taking into consideration the fact that most modern GPUs contain over 100

processors, this can massively reduce the complexity of the algorithm.

The downside of this implementation though is how it uses memory. More

specifically, the implementation is dependent on the size of the dataset and its

efficiency approaches maximum as the number of elements approaches . At

number of elements, the efficiency halves, producing a step function as shown in

Figure 12:

Figure 12, parallel sort performance

However, even with this seeming inefficiency, the algorithm still outperforms fastest

quicksort CPU implementation, even on a mid-range graphics card. Certain

improvements could be added to this algorithm in order to maintain a linear

performance curve, but for the purposes of this library, this implementation was

sufficient.

Figure 13 shows the performance difference between the GPU implementation of

parallel sort and CPU implementation of quicksort in JAVA. As you can see, the GPU

consistently outperforms the CPU even though quicksort runs in time,

whereas bitonic sort runs in time. This is due to the fact that each

sorting network can run independently on its own processor, effectively reducing the

complexity to when running on processors, yielding a much better

performance [14].

23

Figure 13, parallel sort GPU vs CPU performance

GPU-Quicksort

Quicksort is one of the most popular sorting algorithms and needs no special

introduction. It is suitable for large data sets because, as opposed to many other

sorting algorithms, it doesn’t run in exponential time. There are many

implementations of the quicksort algorithm for the CPU, however, parallelizing it in a

manner so that it can run on the GPU is less than trivial.

Quicksort has previously been considered as an inefficient sorting solution for

graphics processors, but in January 2008, Daniel Cederman from Chalmers University

in Sweden published a paper on how quicksort could be mapped and executed on the

GPU architecture in an efficient manner. This algorithm demonstrates that GPU-

Quicksort often performs better than the fastest known sorting implementations for

graphics processors, such as radix and bitonic sort, making it a viable alternative for

sorting large quantities of data on graphics processors [15].

The only downside of this implementation, when compared to parallel bitonic sort, is

that it currently only supports integer sorting. Sorting floating point numbers was

disabled due to a problem with how CUDA handles C++ templates. This is expected

to be corrected in one of the upcoming releases.

24

Just as its CPU counterpart, GPU Quicksort has the complexity of [15]. As

the number of processors increase, the complexity of this algorithm when executed on

p processors can be expressed as:

In words, when the number of processors becomes equal to the number of elements in

the sorting set, the complexity is equal to which is extremely fast for a

sorting algorithm.

Figure 14 shows the performance comparison between JAVA implementations of

CPU and GPU quicksort.

Figure 14, GPU-Quicksort vs CPU-Quicksort

The reason why the advantage is not as obvious for smaller datasets is due to the fact

that CUDA comes with a lot of overhead. In order to execute an algorithm on a GPU,

the CUDA environment needs to be bootstrapped first, then the memory needs to be

allocated on the GPU, followed by the actual computation, and finally transfer of the

results back to the host. All these operations are factored into the final time, so the

benefits really only become obvious for higher at bigger datasets. This generally holds

true for all algorithms, the question is only how computationally intensive and

memory-bandwidth dependant they are. Generally, more computationally intensive

algorithms with minimal memory-bandwidth dependency tend to perform best, as we

will see in the next few examples.

25

Black-Scholes Option Pricing

The Black-Scholes is a very popular and widely used economic model, created in

1973 by Fischer Black, Myron Scholes and Robert Merton. It is a method for pricing

equity options; prior to its development there was no standard way to do this ever

since the creation of organized option trading. In 1997, its creators were awarded the

Nobel Prize in Economics.

The most common definition of an option is an agreement between two parties, the

option seller and the option buyer, whereby the option buyer is granted a right (but not

an obligation), secured by the option seller, to carry out some operation (or exercise

the option) at some moment in the future [16].

The Black-Scholes pricing algorithm greatly benefits from a massively parallel

implementation due to numerous complex mathematical functions that can be

executed simultaneously on different stocks. The CUDA kernel algorithm was taken

from NVIDIA CUDA SDK [17]. Figure 15 shows the performance of JAVA

algorithms for Black-Scholes option pricing, executed on CPU and GPU respectively.

Figure 15, Black-Scholes option pricing, CPU vs GPU

We can see that the performance difference increases exponentially as the dataset

grows. At elements, the CUDA implementation is 20 times faster, even on a mid-

range graphics card. As previously mentioned, this is mainly due to the fact that this

algorithm can be implemented with low number of memory accesses and high number

of calculations per stock.

26

Fast Fourier Transform

The Fast Fourier Transform (FFT) is an efficient algorithm implementation for

computing the discrete Fourier transform (DFT) and its inverse. A Fourier transform

is a calculation needed to visualise a wave function not only in the time domain, but

also in the frequency domain [18]. By observing wave as a function of its frequency,

we can analyse it much closely than just by visualising the amplitude. The Fourier

transform is an invaluable tool in the field of electronics and digital communications,

and as such, a widely used algorithm in computer science. There is a whole field of

mathematics called Fourier analysis which grew out of the study of Fourier series.

Because of the importance of Fourier analysis, NVIDIA has developed a whole sub-

library containing primarily Fourier transform functions. The chosen implementation

is an improved version developed by Vasily Volkov at UC Berkeley. As with all FFT

implementations, dataset sizes have to be equal to a power of two. Figure 16 shows

the performance of my modified GPU implementation called through JNI when

compared to an equivalent JAVA CPU implementation.

Figure 16, Fast Fourier Transform, GPU vs CPU

Fast Fourier transform can be applied simultaneously to a number of point sets. The

point set sizes have to be expressed as a power of two. As a general mid-range value,

a 512 point set was chosen to test the performance. On average, this is a good

representation of the performance, since it varied depending on the chosen

number of points per set. For all set sizes, the GPU implementation far outperformed

the CPU equivalent, even with the JNI and CUDA overhead.

27

Matrix Transform

As the final algorithm to adapt for JNI calls, a 2D matrix transpose program was

chosen. Matrix transpose is a widely used mathematical operation, used in various

fields of science. More importantly, it was an interesting algorithm because it doesn’t

perform any calculations. Rather, all it does is item swapping within a two

dimensional array. As such, it is a good example to show the performance benefits

that can be achieved when shared memory is used efficiently within the thread blocks.

Figure 3 shows the architecture clearly.

Since all threads within a block can access the shared memory concurrently, and

operate on their own sets of data independently, the implementation of this algorithm

has a profound impact on its performance. A common naive implementation of this

algorithm suffers from non-coalesced writes. Basically, threads within a block can

synchronize their writes to global memory to occur simultaneously. The optimized

transpose algorithm with fully coalesced memory access and no bank conflicts can be

more than 10x faster for large matrices [19].

Figure 17, matrix transpose, GPU vs CPU

Figure 17 clearly shows the benefits of using very high speed memory present on

modern graphics cards, as well as exploiting advanced synchronized memory access

features offered by CUDA. At approximately five times faster than the CPU version,

the naive implementation would have been five times slower instead. This point also

illustrates how important the knowledge of underlying hardware is when developing

CUDA algorithms.

28

Implementation Details

As previously stated, there are many components in this system, and getting them to

work together was an iterative process. Therefore, in order to make the

implementation details clearer, code segments will be explained in the chronological

order of their creation. Even though the entry point to the system is the OS and the

compile environment, their installation process will not be detailed as it was long,

tedious and consisted of a trial and error approach. Therefore, the following chapter

will cover the implementation of previously mentioned algorithms and JNI

automation.

Algorithms

SWIG & JNI

In order to make the CUDA algorithms accessible through JAVA, JNI interfaces had

to be written. In earlier chapters it was mentioned that this can be a long and tedious

process in which mappings have to be created between every method and argument

type that needs to be exposed. To make matters worse, in order to make these callable

through a shared library, special directives and headers are required in order to stop

the C++ compiler from obfuscating the function names. The C++ compiler does that

internally because when all functions from all objects are collected, there could be

clashes in the names. This is fine because the references are internally resolved to the

obfuscated names, but it becomes a problem when a function needs to be called

externally.

In order to distance ourselves from these details, and also automate the JNI wrappers

as much as possible we decided to use SWIG. SWIG does just that, through the use of

an interface file that describes all methods that need to be exposed. It’s not as simple

as it sounds though. Since JAVA doesn’t support all the types that C supports, and in

most cases these are not implemented equally, mapping types between JAVA and C

can be very complicated. This is especially true with complex data types like arrays,

pointers, unsigned types and objects.

Thankfully, SWIG comes with some predefined mappings for arrays. Since all of the

algorithms use arrays for holding data to be processed, we had to take advantage of

this feature. Initial prototypes were quite slow and disappointing. The below segment

shows an example of an early prototype of a SWIG interface file:

%module sqrt

%include <arrays_java.i>

%apply float[] {float *};

%{

extern void csquare(float *f);

%}

extern void csquare(float *f);

29

This is a simple interface that builds JNI wrappers for a C function that simply

squares every element of a float number array being passed to it. There are upsides

and downsides to using this approach. The upside is, when this type of mapping is

used (through the arrays_java.i interface file), it is possible to use native JAVA types

and pass them directly to the C method, as shown below:

float[] f = {1,2,3};

sqrt.csquare(f);

for(float x : f){

System.out.println(x);

}

However, the downside of this mapping is that the conversion from the JAVA native

type to the C array is done internally in the generated JNI wrapper file. What this

means is that once the f variable is passed to the sqrt.csquare function, the JNI

wrapper will allocate a new memory segment and then copy each element to the new

location in a C compatible data structure. Then it will pass the reference to that new

memory location to the C algorithm. Same will be done once the C function has

completed execution, only the other way round. This introduces two overheads at the

cost of flexibility: first of all, for each element passed this way we need double the

memory that we would normally need, and secondly the copying of elements back

and forth between C and JAVA also carries a significant time cost.

For small datasets this would be a perfectly acceptable solution with minimal impact

on the programming logic on the JAVA side. However, in high-performance, large

dataset environments in which CUDA is most often used, such a performance hit

could outweigh any possible performance gain of using CUDA in the first place. It

would also place a lower limit on the maximum size of datasets since the memory

requirements double. This was a major hurdle, but further research resulted in the

discovery of an alternative method that would mitigate all of these problems.

Instead of copying the data back and forth, SWIG provides an alternative way of

mapping data structures, by allowing C and JAVA to share the same memory

location. This way, we don’t have to copy the entire data in order to make it available

in the C program, but rather just pass the pointer reference to the location where the

data is stored. This solves all our previous problems, but introduces certain issues as

well. First of all, this method of mapping prevents the use of JAVA in-built types.

Rather, in order to instantiate variables and assign values to them, we have to use

specialized functions provided by the JNI wrapper. For example, for an integer array,

SWIG would generate the following wrappers:

int *new_intArray(int nelements);

void delete_intArray(int *x);

int intArray_getitem(int *x, int index);

void intArray_setitem(int *x, int index, int value);

This provides us with full functionality of manipulating the data inside these types,

however, there is another caveat. Once the data is passed to the C function, it becomes

immutable. This means that any operations that have to performed on this array

cannot be stored in that same array, since the elements cannot be overwritten.

Therefore, in order to send source data and receive result data from the C program

using this method, two parameters had to be used – one with the source data, and the

other simply being an empty initialized array that would store the resulting elements.

30

The modified interface file, supporting this faster but less flexible method, looks as

follows:

%module transpose

%include "carrays.i"

%array_class(float, floatArray)

%{

void transpose(float* h_data, float* result, int size_x)

%}

What this does is map every float pointer argument to the floatArray type, as defined

by SWIG. This creates all the wrappers required for creating and modifying the float

elements within the array, as well as casting it to the pointer representations required

by the CUDA algorithm.

After testing the performance with this alternative approach, the results were very

encouraging. Previously, as the number of elements increased, the speed advantage

started fading since more and more data needed to be transferred between JAVA and

the CUDA algorithm, but this method enabled us to see performance increases in the

order of more than 10 times for the first time, making the overall project feasible

again.

CUDA algorithm adaptation

With being able to generate efficient JNI wrappers using simple interface files, the

time was to try and modify the algorithms so that they could be called using the

exposed methods.

This was quite a demanding task as a lot of code had to be refactored. What had to

remain in the end was a pure algorithm that just accepts the input data, runs it on the

GPU and then places the results into a referenced memory location that JAVA can

use. Each CUDA algorithms consists of two main parts:

1. data & device initialization program

2. CUDA kernel (executed on the graphics card)

The actual CUDA kernels did not have to be modified in any way as they are

executed on the actual graphics card and called by the setup program, so there is no

direct communication between it and JAVA. The first part of the program had to be

modified though. One of the simpler ones was the matrix transpose algorithm, shown

below:

void transpose( float* h_data, float* result, int size_x, int size_y)

{

// size of memory required to store the matrix

const unsigned int mem_size = sizeof(float) * size_x * size_y;

// declare device variables

float* d_idata;

float* d_odata;

// initialize memory on device

cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));

cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

31

// copy host memory to device

cutilSafeCall( cudaMemcpy( d_idata, h_data, mem_size,

cudaMemcpyHostToDevice) );

// setup execution parameters

dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

// execute the kernel

gputranspose<<< grid, threads >>>(d_odata, d_idata, size_x,

size_y);

cudaThreadSynchronize();

// copy result from device to host

cutilSafeCall( cudaMemcpy( result, d_odata, mem_size,

cudaMemcpyDeviceToHost) );

// cleanup memory

cutilSafeCall(cudaFree(d_idata));

cutilSafeCall(cudaFree(d_odata));

}

Let’s run through this quickly, as the base logic used was the same across all

algorithms, even though some were a lot more complex due to the usage of structs

(custom classes) and data types.

As seen at the top, the first thing we have to do is calculate the memory size required

for storing the data. This will be needed in order to allocate the space on the graphics

card. In this particular case, the data passed is a two-dimensional unbounded array.

The h_data variable holds the input data. The size_x and size_y integer variables

specify the number of rows and columns in the matrix we are transposing. Since in C

unbounded arrays are passed as pointer references, there is no way to know how big

they are. In order to stop at the last element and avoid overflowing into potentially

protected memory space (thus crashing the program), these two variables serve as

bounds. The memory size is simply calculated as a product of rows and columns in

the matrix and the actual size of float datatype.

Second step involves declaring variables that will hold the source and resulting data

on the actual graphics card. Once that has been done, we can use the cudaMalloc

function that allocates memory on the device. We pass the references to both source

and destination variables, and their size.

With the memory space allocated, we can now copy the data sent from JAVA (held in

the h_data variable) to the graphic cards main memory. This is done using the

cudaMemcpy function. The reason why all CUDA functions are wrapped within

cutilSafeCall method is so that they die gracefully if anything happens.

The block under “setup execution parameters” sets the grid size and the number of

threads that run per block, depending on the matrix size. Finally, the gputranspose

method uploads the kernel to the graphics card and executes it on the data previously

supplied. The function call passes the reference to the source and result memory

locations, as well as the dataset size.

Finally, the cudaThreadSynchronize() function waits for the graphics card to complete

the operation. Since the task is performed asynchronously, this ensures that the main

32

program waits for the results to be ready on the graphics card before proceeding

further. Once this has completed, the results on the graphics card (held in the d_odata

variable) are copied into the blank result variable we passed form JAVA. The last two

calls perform cleanup on the card, flushing any data left over in the memory.

At this point our results are stored starting with the memory position to which the

results variable points. We can now read those results in JAVA using the before

mentioned SWIG wrappers.

Even though some other algorithm implementations are a lot more complicated than

this (GPU-Quicksort and FFT), this example portrays the general logic in a concise

and easy to understand manner.

Fast Fourier transform algorithm, on the other hand, uses a non-standard type called

float2. This is basically a class consisting of a pair of floats, used to represent two

coordinate points. Since SWIG doesn’t come with a predefined mapping for this type,

it is possible to do this manually. The interface below shows this:

%include "arrays_java.i"

JAVA_ARRAYSOFCLASSES(float2)

%module fft

%{

#include "fft.h"

%}

struct float2{

float x,y;

};

%extend float2 {

char *toString() {

static char tmp[1024];

sprintf(tmp,"float2(%f,%f)", $self->x,$self->y);

return tmp;

}

float2(float x, float y) {

float2 *f = (float2 *) malloc(sizeof(float2));

f->x = x;

f->y = y;

return f;

}

};

As you can see, firstly we redefine the float2 struct and tell SWIG it simply consists

of two floats, x and y. The JAVA_ARRAYSOFCLASSES directive tells SWIG to

generate a wrapper that can work with arrays of this struct. Finally, we can extend the

basic struct to give it some more functionality. For example, the default constructor

created by SWIG is blank and we need to set each coordinate separately. This can be

extended so that we also have a constructor that accepts both coordinates at the same

time, thus shortening the initialization process. Finally, a toString method is

implemented just for convenience, to be able to output the contents in a simple way.

Through this extending functionality we can also throw and catch exceptions,

reducing the possibility of random and untraceable crashes. This is one of the

improvements proposed for future versions of the library.

33

The last addition to the modified algorithms was the init function. This is an addition

that doesn’t have to be used in order to execute the algorithms, but can be a useful

addition for performing batch runs. One of the issues with CUDA is that it has to

bootstrap the running environment first time it is invoked. This usually happens at the

beginning of the program when the available graphics cards are detected, and the

fastest one is chosen. The start up time varies from computer to computer, but in my

particular case it was 200 milliseconds on average. This was quite a performance hit

overall, and it impacted the results for smaller datasets where the average running

times were well under 50 milliseconds.

In order to exclude the bootstrapping time from the algorithm execution timings, a

helper init function was created for each algorithm present in the library. By calling it

at program start-up, the CUDA runtime environment is created straight away and

doesn’t impact further algorithm executions. Here is the body of the init() function:

void init(){

cudaSetDevice( cutGetMaxGflopsDeviceId() );

float* d_data = NULL;

cudaMalloc((void **) &d_data, 1*sizeof(float));

}

This code segment simply finds the fastest graphics card present on the system and

sets it as the active one. Then it creates a blank pointer and performs a memory

allocation of a single empty float on the graphics card. This is the point at which the

CUDA runtime environment is bootstrapped and it doesn’t happen again for as long

as the program doesn’t quit. This proved to be very useful in benchmarking the

algorithms, and would prove useful in any situation where the user wants to make all

preparations in order to be sure that the algorithm gets executed instantly when called.

Reusing the CUDA compiler configuration

With the modified algorithms and JNI wrappers in place, it was time to compile the

source files into a usable library. To do this under Linux, GCC and CUDA compilers

had to be used. Since the CUDA compiler is really an extension of the C compiler,

and needs a C compiler to be present in order to compile CUDA files, it is in fact fully

capable of compiling any file that GCC can compile. Therefore, in the initial compile

environment a single long compile directive was used directly from the CUDA

compiler and it looked something like this, depending on the algorithm:

nvcc -shared -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -

I/home/kosta/NVIDIA_CUDA_SDK/common/inc -Xcompiler -fPIC -L

/usr/local/cuda/lib -lcuda -lcudart -o parallel_sort.so

parallel_sort.cu parallel_sort_wrap.c

This worked well for simple examples, but as the complexity of algorithms increased,

it became a less and less successful method. The biggest issues were getting the

functions to be accessible externally (avoid name mangling) and properly linking all

components from within the library. Eventually this method became more of a trial

and error than anything else, thus unusable in an automated system that had to be built

for the self-service compiler.

In the second attempt to achieve full automation, a compile script was used. This

called compile directives one by one, creating a more structured approach. Firstly, all

34

the C files would be compiled by the GCC compiler, followed by compiling of

CUDA files, and finally executing the linker for collecting all the intermediate object

files into a standalone library. A script to do this would look as follows:

gcc -fPIC -g -c transpose_wrap.c -I$JAVA_HOME/include -

I$JAVA_HOME/include/linux

nvcc -o transpose.cu

ld -shared -soname libtranspose.so -o transpose_wrap.o transpose.o

As you can see, the three steps are clearly visible. GCC compiles the JNI wrapper,

nvcc (CUDA compiler) compiles the cuda sources, and finally the GCC linker (ld) is

used to create a shared library from the compiled object files. This technique worked

better, but eventually it started failing as well. Several algorithms wouldn’t compile

successfully because of unresolved reference errors. This method had to be discarded

too.

Further examination of the CUDA compile directives from the SDK uncovered a

common makefile script. Makefiles are used on Linux for organizing and determining

the sequences in which files are compiled. This is important, especially for bigger

projects, and provides a logical way of precisely handling every step of the

compilation process. The CUDA SDK makefile was very detailed, and covered many

situations in which my compilation techniques failed. Therefore, as a last option this

script was reused and extend by adding the SWIG JNI generation steps, as well as the

final linking step during which everything is combined into a shared library.

The additions to this file included following directives:

$(TARGET): swig makedirectories $(CUBINS) $(OBJS) Makefile

$(VERBOSE)$(CXX) $(CXXFLAGS) -o $(OBJDIR)/$(PROJECT)_wrap.cpp.o -c

$(SRCDIR)$(PROJECT)_wrap.cpp

$(VERBOSE)$(LINKLINE)

$(VERBOSE)mv *.java java/

$(VERBOSE)javac java/*.java

$(VERBOSE)rm -f *.linkinfo

cubindirectory:

$(VERBOSE)mkdir -p $(CUBINDIR)

makedirectories:

$(VERBOSE)mkdir -p java

$(VERBOSE)mkdir -p $(OBJDIR)

swig:

$(VERBOSE)swig -c++ -addextern -java $(PROJECT).i

$(VERBOSE)mv $(PROJECT)_wrap.cxx $(PROJECT)_wrap.cpp

Basically the additions instruct the common makefile script to run the swig generation

process first, followed by creation of directories for the JAVA files and object files,

followed by the compilation of the generated JNI wrapper file. The rest of the process

executed normally and was unchanged. This ensured that the resulting libraries would

have no missing references and that none of the function names would get mangled in

the process. More importantly, it resulted in a more streamlined process since all that

35

was required to fully compile a library was a simple Make script, such as the one

below:

#####################################################################

# Build script for project

#####################################################################

PROJECT := transpose

# Cuda source files (compiled with cudacc)

CUFILES := $(PROJECT).cu

# CUDA dependency files

CU_DEPS := $(PROJECT)_kernel.cu

# Rules and targets

include ../common.mk

This is a very clean script with minimal inputs. As the rest of the project, it follows

the convention over configuration approach, so as long as the filenames are named

correctly, all that needs to be specified is the project name, and the rest is built from

that. Generating these files would be a simple matter, therefore greatly simplifying

automation efforts, and removing any intermediate steps that could cause problems in

the long run. With all the source files ready and the Linux library automation

complete, it was time to implement the same process on the Windows virtual

machine, and obtain the final piece of the puzzle.

The Virtual Machine

The first obstacle that stood in the way of successful addition of the Windows VM

into the existing system was the file exchange mechanism. There are many ways in

which a VM can connect to the existing network and its host, and many ways to

transfer the files between the two, so the simplest one had to be chosen in order to

minimize the number of places where something could go wrong.

After carefully considering all options, the final choice was to use the file transfer

protocol. It is a long existing and well established method of transferring files, and as

such would serve as a good starting point. The second obstacle was getting the host

FTP client and virtual FTP server to communicate together. This could have been

done through the network infrastructure, but was not necessary since the virtual

machine resided on the same computer as the host. Therefore, a direct connection

between the two made more sense. This was done by opening certain ports on the host

and guest. For the purposes of FTP communication, ports 21 and 8021 were used

respectively. The following commands made this possible:

VBoxManage setextradata "WindowsXP"

"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/Protocol" TCP


"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/GuestPort" 21


"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/HostPort" 8021

36

With the guest and host being able to communicate with each other, it was time to

install the necessary FTP servers. The Cerberus FTP Server [20] was used on the

Windows machine in order to accept any incoming connections.

The last step required was to write a script that would upload the source files via FTP

to the VM, call a script on the VM that would find these files and compile them, place

them back in the FTP home directory and notify the host script that the compiled

library is ready to be picked up. The host script would then download the resulting

package and extract it to the appropriate location. To do this, python scripts were used

on both the host and the VM. First, a snippet from the host script:

ftp = FTP()

FTP.connect(ftp, 'localhost', 8021)

ftp.login()

ftp.storbinary("STOR " + os.path.basename(f.name), f, 1024)

ftp.quit()

os.remove("%s/%s.zip" % (tmp, algorithm))

url = "http://127.0.0.1:8080/ccompiler/index.py?f=%s.zip" % algorithm

urllib.urlopen(url)

FTP.connect(ftp, 'localhost', 8021)

ftp.login()

ftp.retrbinary("RETR %s_compiled.zip" % algorithm,

open("%s/%s_compiled.zip" % (tmp, algorithm), 'wb').write)

ftp.delete('%s_compiled.zip' % algorithm)

ftp.quit()

unzip("%s/%s_compiled.zip" % (tmp, algorithm),

"%s/algorithms/%s/windows" % (cwd, algorithm))

os.remove("%s/%s_compiled.zip" % (tmp, algorithm))

This is not the complete script, but it shows the logic. The host connects to the VMs

FTP client running on port 8021 and uploads the zip package with all required

sources. Once this is done, it calls the python script on the VM using the HTTP

protocol, and sends it the name of the algorithm it just uploaded. This script will be

explained in more detail later on. Once it has completed, the host connects to the FTP

server again and downloads the compiled package to a temporary area. Finally it

extracts it and cleans up any files that are no longer required. The following is the

snippet from the script running on the VM:

def handler(req):

req.content_type = 'text/html'

home = 'C:/wamp/www/ccompiler'

filename = params.getfirst("f")

algorithm = os.path.splitext(filename)[0]

unzip('c:/ftproot/%s' % (filename), '%s/extracted/%s' % (home,

algorithm))

os.chdir('%s/extracted/%s/' % (home, algorithm))

37

files = glob.glob( os.path.join('%s/extracted/%s/' % (home,

algorithm), 'make.bat') )

if(len(files) == 0):

os.system('nvcc.exe -shared -I"C:\Program Files\NVIDIA CUDA

SDK\common\inc" -I"C:\Program Files\Java\jdk1.6.0_12\include" -

I"C:\Program Files\Java\jdk1.6.0_12\include\win32" -IC:\CUDA\include

-l"C:\CUDA\lib\cudart" -l"C:\CUDA\lib\cutil32" -o %s.dll %s.cu

%s_wrap.cpp' % (algorithm, algorithm, algorithm))

else:

os.system('make.bat')

zip = zipfile.ZipFile('c:/ftproot/%s_compiled.zip' % algorithm,

'w')

for f in os.listdir(os.getcwd()):

if (f.find('.dll')!=-1):

zip.write('%s' % f)

As you can quickly see from this script, the Windows environment actually uses a

single compile directive by default. This was done because the makefile script from

Linux could not be reused. In case the project is too complex to compile with that

simple directive, an optional make.bat file can be supplied with the required directive.

This is used for the FFT algorithm. The rest of the script simply takes care of

extracting the file to a working directory, locating the library once the compilation has

finished and then packaging it and moving back to the FTP home where it can be

picked up by the host.

Test suites

In order to test the performance, a benchmark program was written for each

algorithm. The benchmarks consist of the following components:

CPU equivalent algorithm to compare the performance against

Comparison algorithm for finding the median difference between the two

resulting datasets

Initialization of data and timings of each run

CSV file storing the results

Finally, a shell script is used in order to compile all the required files, run the

benchmarks and tests, and plot the CSV files on a graph using GNU Plot. Let us look

at one such benchmark suite. It should give us a good indication of how the algorithm

is supposed to be used in a real world example.

First step is to load the library. This is done dynamically at runtime. As mentioned in

the first chapter, JAVA has two ways of doing this. It can either resolve a library from

an absolute filepath provided, or from a fully qualified name. The second approach is

a lot more flexible because it comes with some extra functionality. If the library is

loaded on a Windows platform, it will automatically append the .dll extension and try

and load it like that. Otherwise, if the JVM is running on Linux, it will prepend “lib”

and append .so to the given name, as per convention. The downside is the fact that the

38

library we are calling needs to reside in one of the directories specified on the

classpath. Below is the actual statement:

System.loadLibrary("blackscholes");

This will give us instant access to any methods available within that library.

Initialization of the elements is quite straightforward, but shows how we cannot use

JAVA provided types for passing the dataset to the C library:

floatArray cudaResultCall = new floatArray(size);

Instead of using a standard java float[] we need to use the SWIG provided floatArray

class instead. Once the elements have been initialized, we simply run the algorithm on

the GPU and the CPU, timing both runs separately. In order reduce effects of

background tasks influencing the execution times, each test run is executed several

times, and the average length is taken.

Finally, in order to ensure that we are getting correct results from the graphics card,

the returned dataset is compared to the reference result executed on the CPU. For

algorithms that deal with integers or don’t modify the actual data, a simple one to one

comparison is sufficient. However, with algorithms that perform complex

mathematical operations on floating point values, we need to ensure that the resulting

data is close enough to the reference. For the Black-Scholes and fast Fourier

transform algorithms, this was necessary. The algorithm chosen for comparison was

the L1 norm, also known as rectilinear distance. This is used because the general

accepted margin of error for a floating point operation is . By calculating the L1

distance between the GPU result and reference result, we can check if it conforms to

this expected standard of precision. The algorithm for this is shown below:

//Calculate L1 (rectilinear) distance between CPU and GPU results

public static double L1norm(float[] reference, floatArray cuda){

double sum_delta = 0;

double sum_ref = 0;

for(int i = 0; i < reference.length; i++){

double ref = reference[i];

double delta = Math.abs(reference[i] - cuda.getitem(i));

sum_delta += delta;

sum_ref += Math.abs(ref);

}

return sum_delta / sum_ref;

}

If the distance is the GPU result is accepted and considered valid.

With all the timings for various dataset sizes completed, and all the tests passed, the

last step is to read the produced CSV file and plot a graph. GNU Plot was chosen for

this purpose since it is a versatile and powerful tool for reading CSV files and

displaying graphs. A small script is required in order to specify certain parameters, as

shown below:

set title "2D Matrix Transpose"

set xlabel "Elements"

set ylabel "Milliseconds"

set logscale x 2

39

set data style linespoints

set grid ytics

set terminal png size 640,480

set output "benchmark.png"

plot "benchmark.txt" using 1:2 title "CPU", \

"benchmark.txt" using 1:3 title "CUDA" ls 3

set terminal wxt

set output

replot

Using a script like that, it is possible to define the axis labels, graph title, size of the

output file, input files, type of graph etc. The above script produces graphs that were

shown in the previous chapter under the CUDA Algorithms heading.

To tie all of this together, a shell script was written to perform clean-up, compiling,

building, execution and graph displaying for any given algorithm. This was done to

enable quick profiling of any changes and tweaks performed. It is shown below:

#!/bin/bash

if [ `find . -name "*.class" | wc -l` -lt `find . -name "*.java" | wc

-l` ]

if [ $# -ne 1 ]; then

echo 1>&2 Usage: $0 algorithm_name

exit 127

fi

algorithm=$1

rm -f algorithms/$algorithm/test/*.class

javac -classpath

algorithms/$algorithm/java/:algorithms/$algorithm/test/

algorithms/$algorithm/test/*.java

java -Xms128m -Xmx256m -

Djava.library.path=/var/www/cuda4j/algorithms/$algorithm/linux/ -

classpath algorithms/$algorithm/java/:algorithms/$algorithm/test/

Test

gnuplot plot

The web site

General features

The framework on top of which the website is built is Zend framework. The

supporting javascript library for adding effects and beautification is jQuery. The two

in combination are a widely used and thoroughly tested platform for building feature

rich, modern web applications. MySQL database was used for storing algorithm

information.

Getting the Zend framework up on its legs is probably the most time consuming

operation because it requires a bootstrap file. This file determines everything about

the running instance, ranging from the database configuration, directory structure,

front controller definitions, session information and more. The functionalities of the

library download page, the self-service compiler and TRAC will be covered.

40


The purpose of this page is to offer the user to hand pick the library components

he/she is interested in. This “build it yourself” approach has been very popular with

many successful frameworks and libraries on the internet, so the same approach was

taken. This way, the users get only exactly what they need so the clutter and

bandwidth waste is minimal.

Figure 6 shows the screenshot of this page. All the individual algorithms are grouped

under category headings for easier navigation. A checkbox is positioned next to each

one for quick and easy selection. Finally the download button submits the user

selections to the web server and returns a zip with the packaged choices. The PHP

code below performs this functionality:

public function downloadAction(){

if($this->_request->getPost("lib")){

$libs = $this->_request->getPost("lib");

$file = "/tmp/".session_id().".zip";

$command = "zip -q ".$file;

foreach($libs as $lib){

$command .= " algorithms/".$lib."/linux/*.so

algorithms/".$lib."/windows/*.dll algorithms/".$lib."/java/*.class

algorithms/".$lib."/test/Test.java";

}

if(file_exists($file)) @unlink($file);

exec($command);

header("Pragma: public");

header("Expires: 0");

header("Cache-Control: must-revalidate, post-check=0, pre-

check=0");

header("Cache-Control: private",false);

header("Content-Type: application/zip" );

header("Content-Disposition: attachment;

filename=\"cuda4jlib.zip\";");

header("Content-Transfer-Encoding:Â binary");

header("Content-Length: ".filesize($file));

readfile($file);

exit();

}

}

The first line reads the posted form data. It constructs a zip command for compressing

all the algorithms by iterating through the selected ones, and adding each to the

archive. Finally, it constructs all the necessary HTTP headers for sending a file to the

user, and then proceeds to send it. The user receives the archive with all the classes

and libraries needed to run the algorithms.


The self-service compiler is one of the main features of the website. Getting the

compile environment in place was a big task to start with, but most of the effort went

into the automation of the JNI wrapper generation and synchronized compiling on

both Windows and Linux platforms. Exposing this functionality to the outside world

was one of the key drivers behind the project. With the ability to perform self-service

compiling, the user could write all the CUDA algorithms without writing a single line

of JNI glue code, or having to install the CUDA compile environment and backing C

41

compilers. This is a great time saver and makes the barrier to entry for JAVA/CUDA

programming much lower. The below PHP code shows how the JNI wrapper

generation and CUDA compiling is kicked off from the web application:

$formData = $this->_request->getPost();

if ($form->isValid($formData)) {

// success - do something with the uploaded file

$uploadedData = $form->getValues();

$project = $uploadedData['name'];

$exposed = $uploadedData['exposed'];

$mappings = $uploadedData['mappings'];

$fullFilePath = $form->file->getFileName();

$destinationPath =

APPLICATION_PATH.'/data/uploads/'.session_id().'/'.$project;

if(!file_exists($destinationPath)) mkdir($destinationPath);

@exec("unzip " . $fullFilePath . " -d " . $destinationPath);

// Swig interface file generation

$swigfile = $destinationPath."/".$project.".i";

$fh = fopen($swigfile, 'w') or die("can't open file");

fwrite($fh, $this->generateSwigInterface($project, $exposed,

$mappings));

fclose($fh);

// Linux makefile generation

$makefile = $destinationPath."/Makefile";

$fh = fopen($makefile, 'w') or die("can't open file");

fwrite($fh, $this->generateMakefile($project));

fclose($fh);

chdir(APPLICATION_PATH.'/data');

copy("/var/www/cuda4j/algorithms/common.mk",

APPLICATION_PATH.'/data/uploads/'.session_id().'/common.mk');

exec("python compile.py ".session_id()." ".$project);

$this->downloadAction($project, $destinationPath);

exit();

}

Since most of the system was automated to begin with, all we are left with is some

simple code to create system calls for the existing scripts that do the grunt of the

work! All the effort has paid off in the end.

Figure 8 shows the self-service compiler form. It takes minimal input, but assumes

certain things about the input archive. The file must be submitted as a zip, containing

all the source files that need to be compiled. Also required is a C header file that

contains all functions that are to be exposed. By exposed we mean accessible in the

final JAVA compatible library. From this header file the compiler will know how to

construct the JNI wrappers around the library, so if a submitted archive does not

contain it, the system will reject it. Project name field is self explanatory, which

leaves the “pointer mappings” as the only ambiguous one. In this early prototype of

42

the application, only simple SWIG interface file generation is supported, and as such

only one type mapping. This is a quick way to handle arguments referenced as

pointers. For example, a float pointer might be mapped to a floatArray if that is its

intended use. Future versions should have a more fine-grained approach to the SWIG

interface file information editing, but as a proof of concept, it simply accepts a comma

separated list of types to map.

Project Tracking (TRAC)

The final component of the page is the project tracker. This is a much needed part of

any open-source application / library. There is no need for reinventing the wheel in

this area as there are many excellent free applications for project management. TRAC

is no exception to this and probably the most widely used platform in the open source

community. As such, it was the system of choice for our particular library too. Its

deployment is relatively straightforward on Ubuntu. Once set up, it was to connected

to SVN, the subversion repository where all the algorithm code was checked in. This

enables browsing and comparing of various revisions from within TRAC itself. It can

also tag the major release versions and streamline the roadmaps, milestones and bug

logging into one harmonious system. With TRAC in place, our system was ready for

contributions from any interested parties.

Conclusion

Meeting the Objectives

This project was a venture into the unknown. The idea was to make the benefits of

NVIDIA CUDA technology available to a wider audience in a ready to use package.

The feasibility of turning this idea into reality was unknown at the time it was

initiated. As such, it was an incremental process, with little steps being overcome

every week. It was quite an ambitious undertaking as well, since nothing of the sort

has been done before. There are several projects, such as JACUDA [21], that try to

achieve a similar goal, but they are complicated to set up, have numerous

dependencies without which they cannot be used and offer a very limited feature set.

Instead, we didn’t want any compromises, but a system that even a JAVA beginner

could use.

With that said, it wasn’t a smooth process either. There were many situations in which

it was necessary to take a step back and re-think parts of the system. Certain aspects

had to be re-developed from scratch several times in order to achieve the performance

and flexibility required. It is also important to note that CUDA is a brand new

technology, barely over two years old, so any sort of supporting documentation and

technical papers was difficult to procure. The obtained materials mostly consisted of

official SDK examples and documentation, and some stray university courses with

on-line materials. By diving into the JAVA world and trying to unify the two, we

were left on our own.

43

With that in mind, I can say with confidence that most of the initial objectives were

met. Not only that, the progress was incremental and on time with every week’s

milestone meeting. During the whole time, the emphasis was on making a successful

proof of concept rather than a fully polished product. There are still many areas left

for improvement in the work that has been done, such as safer handling of errors,

more advanced type mapping for the self-service compiler and a wider choice of

algorithms within the precompiled library showcase. This only shows that the project

has a potential to grow, and with sufficient help and support from the open source

community, it could turn into a versatile product that many would find useful.

Taking a different approach

The method used for achieving the project objectives was chosen early on. There were

many paths that could be taken, and only one had to be chosen. But as it usually is

with complex systems, each solution had its advantages and disadvantages. Looking

back, several different methods might have been taken, depending on the intended

usage.

First and foremost, the fact that these libraries are platform-dependent and only

support Linux and Windows is far from perfect considering the versatility of the JVM.

Secondly, the algorithms in their current state are very vulnerable to erroneous input

and can crash quite easily considering there is presently no mechanism for throwing

exceptions. To reduce these negative effects, a library called JCUBLAS [22] could

have been used in order to re-build these algorithms from scratch, entirely in JAVA.

What JCUBLAS does is provide JNI wrappers for the entire NVIDIA CUDA BLAS

(basic linear algebra subprograms) set of functions. This way, any CUDA program

that uses BLAS functions can be rewritten in JAVA code.

There are a few downsides to this approach. As I previously mentioned in detail, this

method will copy and duplicate the memory contents between the JVM and CUDA.

This process was found to be very slow for larger datasets, and in most cases it

completely outweighed any benefits of using CUDA in the first place. Secondly, such

an implementation suffers from the same problem as our own – it is platform

dependent, and the JCUBLAS library specific for the running operating system needs

to be present. On the other hand, the upside of using this method is that it is extremely

flexible since there are no special data types to be used. Also, there are no 3rd

party

compilers or wrappers needed to compile the programs, making the end product a lot

more robust and less error prone. Considering CUDA is primarily a high-performance

application, introducing such a performance hit was deemed unacceptable, although

for certain algorithms where speedups are measured in orders of magnitude, this

approach might be better suited.

Lastly, there is always a chance that JAVA will incorporate a CUDA consuming

library into its own JRE package. Naturally, this would be the best solution as it

would remove all of the overheads associated with alternative methods, and also

provide much needed platform independence and stability. Due to licensing reasons

and the fact that CUDA is not open source and uses its own proprietary compiler, the

chances of this happening in the near future are minimal. As such, an intermediary

solution such as the one we achieved with this project will be a welcome addition to

the growing CUDA community.

44

Bibliography

[1]. MapReduce: Simplified Data Processing on Large Clusters. Dean, Jeffery and

Ghemawat, Sanjay. 2004, OSDI 2004.

[2]. Client Statistics by OS. Folding@Home. [Online] 16 May 2009. [Cited: 17 May

2009.] http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats.

[3]. Fact Sheet & Background: Roadrunner Smashes the Petaflop Barrier. IBM Press

room. [Online] IBM, 09 June 2008. [Cited: 17 May 2009.] http://www-

03.ibm.com/press/us/en/pressrelease/24405.wss.

[4]. Vaquero, Luis M. A break in the clouds: towards a cloud definition. ACM

SIGCOMM Computer Communication Review. 2009, Vol. 39, 1.

[5]. Moore's Law: Made real by Intel® innovation. Intel. [Online] Intel. [Cited: 17

May 2009.] http://www.intel.com/technology/mooreslaw/.

[6]. Intel will demo its first multi-core CPU at IDF. EE Times. [Online] United

Business Media, Sept 2004. [Cited: 17 May 2009.]

http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=46200165.

[7]. AMD “Close to Metal”™ Technology Unleashes the Power of Stream

Computing. AMD Newsroom. [Online] AMD, 14 November 2006. [Cited: 18 May

2009.] http://www.amd.com/us-

en/Corporate/VirtualPressRoom/0,,51_104_543~114147,00.html.

[8]. NVIDIA CUDA Programming Guide. s.l. : NVIDIA, 2009. 2.2.

[9]. GeForce GTX 295. NVIDIA Web. [Online] NVIDIA. [Cited: 19 May 2009.]

http://www.nvidia.com/object/product_geforce_gtx_295_us.html.

[10]. SWIG. Simplified Wrapper and Interface Generator. [Online] University of

Chicago. [Cited: 20 May 2009.] http://www.swig.org/.

[11]. Trac. Integrated SCM & Project Management. [Online] Edgewall. [Cited: 20

May 2009.] http://trac.edgewall.org/.

[12]. VirtualBox. Licensing FAQ. [Online] Sun. [Cited: 22 May 2009.]

http://www.virtualbox.org/wiki/Licensing_FAQ.

[13]. Chalopin, Thierry and Demussat, Olivier. Parallel Bitonic Sort on MIMD

shared-memory computer. Metz : Supelec, 2002.

[14]. Kider, Joseph T. GPU as a Parallel Machine: Sorting on the GPU.

Philadelphia : Penn Engineering CIS, 2005.

[15]. Cederman, Daniel and Tsigas, Philippas. A Practical Quicksort Algorithm for

Graphics Processors. Goteborg, Sweden : Chalmers University of Technology, 2008.

45

[16]. Kolb, Craig and Pharr, Matt. GPU Gems 2: Programming Techniques for

High-Performance Graphics and General-Purpose Computation. s.l. : Addison-

Wesley Professional, 2005.

[17]. Podlozhnyuk, Victor. Black-Scholes option pricing. s.l. : NVIDIA Corporation,

2007.

[18]. Morita, Kiyoshi. Applied Fourier transform. s.l. : IOS Press, 1995.

[19]. NVIDIA Corporation. Matrix Transpose Source Code. 2008.

[20]. Cerberus FTP Server. Cerberus Software. [Online] Cerberus LLC. [Cited: 17

April 2009.] http://www.cerberusftp.com/.

[21]. JaCUDA. Sourceforge. [Online] [Cited: 24 May 2009.]

http://jacuda.wiki.sourceforge.net/.

[22]. JCUBLAS. [Online] [Cited: 23 May 2009.]

http://javagl.de/jcuda/jcublas/JCublas.html.

[23]. David, Kirk and Wen-Mei, Hwu. CUDA Textbook. 2009.

[24]. Liang, Sheng. The Java native interface. s.l. : Addison-Wesley, 1999.

cuda algorithms for java

Documents