towards rapid prototyping of parallel and hpc applications (gpu

Towards Rapid Prototyping of Parallel and

HPC Applications (GPU Focus)

(MSc. Project Final Report)

Submitted to the faculty of

The School of Computing at The University of Utah

in partial fulfillment of the requirements for the degree of

Masters of Science

in

Computer Science

By

Mohammed S. Al-Mahfoudh, u0757220

May 1, 2013

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

SUPERVISORY COMMITTEE APPROVAL

of a project report/thesis submitted by

Mohammed S. Al-Mahfoudh

This thesis has been read by each member of the following supervisory committee and by majority vote

has been found to be satisfactory.

Date Chair: Prof. Ganesh Gopalakrishnan

Date Prof. Mary Hall

Date Prof. Zvonimir Rakamaric

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

FINAL READING APPROVAL

To the Graduate Council of the University of Utah:

I have read the thesis of Mohammed S. Al-Mahfoudh in its final form and have found that (1) its format,

citations, and bibliographic style are consistent and acceptable; (2) its illustrative materials including figures,

tables, and charts are in place; and (3) the final manuscript is satisfactory to the Supervisory Committee and

is ready for submission to The Graduate School.

Date Chair: Prof. Ganesh Gopalakrishnan

Martin Berzins

Chair/Director

Charles A. Wight

Dean of The Graduate School

4

Abstract

Developing on highly parallel architectures is hard, time consuming, error prone, takes a lot of develop-

ers’ focus and effort to producing a production quality application. This is counter productive and results

are un-known in advance whether it is worth it to go through such experience. In this work, we will take

a complete overview of prototyping the parallelization of an application from sequential to multi core and

GPU architectures, with focus on GPUs. This is in an effort to find a more developer friendly means of

achieving said goals.

In this project report, we are sharing our experience and results for the efforts trying to find a faster

way to program the prevalent accelerator-devices by porting a benchmark from the benchmark suite (PAR-

SEC). Both multicore CPUs and GPUs are prevalent architectures in both HPC, Supercomputing, regular

mainstream computers and even mobile devices such as phones and tablets. These architectures are used in

almost every possible workload. It is important for every one’s computing experience and/or investment to

utilize such hardware. The difficulty with massively parallel programming is that they are not easy to get

right within the often allowable time frame neither there is good enough support to program them in a more

modular approach. During our efforts to port this benchmark, we try to asses and decide which framework(s)

can best be the fastest way to harness such processing power if not for production quality applications, then

for prototyping such applications. This is important to make sure of correct behavior and results out of

such programs as well as assuring the worthiness of parallelizing them and finding subtle issues that are not

known but at the time of development earlier in time. Documenting findings and a proposed workflow using

the framework(s) of choice to facilitate the adoption of such high performance devices is then shared.

Table of Contents

1 Introduction 13

1.1 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.1 Motivation and Maturation of project direction . . . . . . . . . . . . . . . . . . . . 15

1.2.2 The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 About this project and contribution(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.1 Significance of this project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.2 Industrial and Scientific Significance of this project . . . . . . . . . . . . . . . . . . 18

1.3.3 Difficulties, Intellectual challenges and risks faced . . . . . . . . . . . . . . . . . . 18

1.3.4 How difficulties and risks were solved or avoided . . . . . . . . . . . . . . . . . . . 19

1.3.5 Organization of the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Top Frameworks Evaluation: The feature sets evaluation 21

2.1 PyCUDA [1, 2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.1 Other thoughts and possible workaround for some PyCUDA disadvantages . . . . . 24

2.2 CopperHead [3, 4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 MPI for Python[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Thrust - A Parallel Algorithms Library [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Features viewed differently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Best of all frameworks/libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.1 Criteria of Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.2 Why One chosen over other(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6

jCuda vs PyCuda vs CUDA C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . 36

PyCuda vs CopperHead vs MPI for Python . . . . . . . . . . . . . . . . . . . . . . 37

Thrust vs CopperHead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Best prototyping-to-production Workflow Proposed . . . . . . . . . . . . . . . . . . . . . . 39

2.6.1 C++-based and an evolutionary prototype . . . . . . . . . . . . . . . . . . . . . . . 39

2.6.2 Python based yet a throw-away prototype . . . . . . . . . . . . . . . . . . . . . . . 40

2.6.3 Optimization of both C++ and Python based Prototypes . . . . . . . . . . . . . . . 41

2.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 CopperHead: An Embedded Data Parallel Language 51

3.1 Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 The internals: How it works [4, 3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Compiler Architecture Abstraction [4, 3] . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Restrictions, Constraints and Language Specifications [4] . . . . . . . . . . . . . . . . . . . 59

3.3.1 Language Specifications [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.3.2 How it determines Synchronization points [4] . . . . . . . . . . . . . . . . . . . . . 61

3.3.3 Shape analysis [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Performance and charts [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Black-Scholes Benchmark: The Exercise 64

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.1 The reason(s) behind porting this benchmark . . . . . . . . . . . . . . . . . . . . . 66

4.1.2 Previous implementations/references . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1.3 Suitability for GPU/CPU acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1.4 Blackscholes operation abstraction: Overview of how the algorithm works . . . . . 68

4.1.5 Mathematical background[7, 8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 CUDA C Implementation (similar frameworks apply) . . . . . . . . . . . . . . . . . . . . . 70

4.2.1 Mapping the algorithm to GPU blocks and threads . . . . . . . . . . . . . . . . . . 70

4.2.2 Trade offs considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.3 Synchronization and communication between streams . . . . . . . . . . . . . . . . 72

4.3 CUDA C version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7

4.4 Copperhead Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5 Thrust Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.1 Performance and Productivity comparisons . . . . . . . . . . . . . . . . . . . . . . 74

4.7 Failure case(s) and solution(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8.1 What has been learned? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.8.2 What was expected and how it differs/similar to expectations? . . . . . . . . . . . . 79

4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Conclusion, Findings and Lessons Learned 83

5.1 Contributions and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8

List of Tables

2.1 Summary of most important features from several frameworks that are of interest to fast

prototyping of parallel applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Time taken to complete calculations per implementation . . . . . . . . . . . . . . . . . . . 76

4.2 Lines of code required for a framework-specific implementation of the benchmark and rough

time estimation of how long it took to complete and debug it. . . . . . . . . . . . . . . . . . 77

9

List of Figures

2.1 Proposed Work-flow diagram illustrating the first major step in Rapid Prototyping of Parallel

programs using Thrust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2 Proposed Work-flow diagram illustrating the first major step in Rapid Prototyping of Parallel

programs using Copperhead, PyCuda and MPI4py . . . . . . . . . . . . . . . . . . . . . . . 49

2.3 Using an evolutionary development model, one can optimize or migrate to another parallel

platform starting from a prototype2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Performance Comparison of multiple implementations each done in a target framework,

smaller data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Performance Comparison of multiple implementations each done in a target framework,

larger data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.1 Mapping of blackscholes to CUDA GPU is a straight forward process since it is embarrass-

ingly parallel problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

10

List of listings

2.1.1 Example PyCuda kernel declaration and implementation is passed to pycuda as a string in a

module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.2 Example PyCuda kernel call for the above defined kernel in Listing 2.1.1 . . . . . . . . . . . 24

2.1.3 Example PyCuda kernel and its call [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Example Example MPI4Py showing object based communication between processes[1] . . . 27

2.3.2 Example MPI4Py showing numpy integration for higher performance communication be-

tween processes [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Example Thrust sort example showing how abstract it is to use Thrust like any notrmal C++

library without worrying about the device specific details[6] . . . . . . . . . . . . . . . . . 31

2.7.1 OpenACC tiny example to show how directives are used to parallelize a simple program;

SAXPY [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7.2 OpenHMPP tiny example to show how directives grow huge for the simplest parallel loop(s)[10] 43

2.7.3 A sequential matrix multiply written in C++ to be transformed into parallel MatrixMul using

HPP constructs [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7.5 Swift is a proven scientific language but its syntax is indeed confusing and isn’t exactly

fast and easy to learn for prototyping. Program taken from a beginners’ tutorial from Swift

official website [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7.4 HPP example program illustrating the use of different framework constructs to achieve het-

erogeneous devices and task-parallel and data-parallel computations from C++ code [11] . . 45

2.7.6 STREAM Triad done in CUDA C, not including the necessary host code to run it which

may be is significant too[13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.7.7 STREAM Triad done in Chapel ant targetting GPUs, the whole code ready to run[13] . . . . 47

11

2.7.8 STREAM Triad done in Chapel and targetting a cluster, ready to run[13] . . . . . . . . . . . 47

3.2.1 Copperhead example benchmark implementation with illustration of how places is used

to specify the target device(s) [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.2 Sample cache directory structure after running the above benchmark . . . . . . . . . . . . . 56

3.2.3 GPU src output from the CopperHead program above . . . . . . . . . . . . . . . . . . . . . 57

3.2.4 Host src output from the Copperhead program above to run the above GPU program in

Listing 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 valid and invalid examples of tuple binding/access inside a Copperhead procedure . . . . . . 60

12

Chapter 1

Introduction

13

1.1 Acknowledgment

It is an honor to acknowledge my professor’s contribution in the success of this project and bringing

my understanding to a higher level. I would sincerely like to thank all my graduate advisory committee

professors for their limitless support and guiding me through the way to success. They offered me both

advice, resources, references and expertise that made my life much easier than any one can expect to get

around a hard to tackle problem in computer science. Their continuous communication and patience to

explain, review and correct my work is a corner stone during and before starting this project.

I would like to thank my family for their support, especially, my wife Fatimah for her patience and

non-stop support. She took care of all my home needs and scheduling official appointments and paper work

in a perfect way. She kept pushing me towards success when I am down to some exhaustion and non-ending

worry to miss a beat. She also is the one that kept me healthy and well dressed by getting me out of my office

and giving me the best breaks from study and work on this project at the right time when I most needed it.

I would like to express my great appreciation for The University of Utah’s School of Computing decision

for admitting me in the first place into this prestigious school. The culture of this leading school is absolutely

never heard of. Friendly in every way with students that makes staying at school is never different than going

in a fun trip on a daily basis and enjoying time and work in a unique mixture of care and cooperation between

all students and professors alike. By admitting me into this school to do my MS. Degree in Computer

Science, it gave me the best chance to meet the best people in my life. My professors, family and friends

support was, and still is, so crucial that without it I would probably not succeed in my graduate study. Thank

you all for everything.

14

1.2 Introduction

Developing Parallel applications is difficult, error prone and sometimes ends up with a little useful accel-

eration and/or deployable production quality application. In addition, It is very costly in terms of program-

mer time as well as effort and thought put to make such applications work. Worthiness of parallelization

is not realizable but after all the costs above, and more incured in the process. This is counter-productive

in many application fields. Faster ways to ”prototype” such applications in order to asses and to know the

subtle details are needed. The lack, or even absence, of formal methods aids and extremely highly priced

advanced debuggers for massively parallel applications makes it even more difficult to achieve such task

when budget is an issue. As a result, parallel hardware architectures go mostly underutilized due to the

above difficulties and the infeasibility for most applications vendors to invest in such projects. Debuggers

and profilers bundled with the frameworks are helpful but not suited for debugging mega scale parallelism.

There are a lot of problems programmers currently are facing. Programmers of massively parallel pro-

grams, my self included, try to utilize either a sequential prototyping to prove their programs will work

correctly, or try to simulate the programs that they don’t understand well to find out ways to parallelize

such simulations. Most of simulations are simply wrong since they may be either targeting too wide prob-

lem, or because of understanding and modeling mistakes in the first place originating from programmers

themselves. Simulation is mostly based on the sequential understanding of programs and most probably

are going to produce ”wishful” results instead of actual implementations results. This is not to mention

that some simulations go way farther than required for some applications by considering possibilities that

are not an issue in the application development due to the unclear picture from programmers’ perspective

about such applications. Sequential programs/prototypes don’t expose the issues mainly faced by parallel

programmers e.g. bounds checking, indices calculation, caching ...etc.

The two mentioned approaches don’t provide any type of correctness guarantees for developers to reason

about their parallel counterparts’ correctness or valid behaviors. Parallel programs prototyping in such case

is the only practical, fast enough and cheapest approach to achieve the said goals.

1.2.1 Motivation and Maturation of project direction

During my masters of science studies, which is focused on parallel processing and HPC, I developed

around 9 parallel applications. Some of which are CUDA C based, some of which are MPI based, some

15

utilizing threads and thread pools and some that use hybrid approaches. In contrast to task parallelism, that

is covered by normal threading models like POSIX threads or those threading libraries that are bundled

with almost all programming languages, highly parallel models are truly a pain to develop applications with

in order to achieve reasonable acceleration of sequential programs. I observed through such exercises that

most of correctness errors/bugs come from one of the following: synchronization between threads/warp-of-

threads, indices calculations of threads vs arrays of data to be processed in parallel, ranges to divide data

sets between say threads-blocks in CUDA, or ranges to balance load between multi-core and/or multinode

parallel machines, and some more advanced techniques for overlapping data transfers with computations

or with other data transfers[14, 15, 16]. The more I developed, the more I am convinced it won’t get

any better using the high efficiency parallel programming models e.g. nVidia CUDA C, PyCuda, Swift,

OpenACC, OpenCL ... etc. Actually, the more generalized the framework the more difficult it gets to be

used and the less acceleration is achieved. [14, 15] A faster more abstract framework(s)/API are very much

needed[4, 6]. Abstract frameworks allowing programmers to focus on the problem in hand and how to break

it into parallelizable code fragments instead of distraction in framework specific details like aforementioned

error-sources. For the above, I decided to come up with a set of patterns that eases programmers’ life, and

mu life, during such tasks. However, patterns are already their and parallelizing compilers pin-point and

address the very problem parallel programs developers face.

1.2.2 The Thesis

The thesis Unfortunately, and regardless of technologies like parallelizing compilers and parallel primi-

tives exist, there is no clear path of how to utilize such technologies to aid the process of HPC and Parallel

programming efforts. Moreover, generalized frameworks tend to ignore the fact that almost all parallel

programs are special-purpose and no generalization may help ease the situation. Such fact is emphasized

by looking at the cheer number of scientific computing applications that mostly don’t reuse most, if not

all, parts of previously developed parallel applications. Of course, an exception to such claim is mathe-

matical computation frameworks that parallelize all mathematical primitives/algorithms to make it readily

made and available for use and similar scientific fields specific algorithms e.g. physics, chemistry, nuclear

physics...etc. Chances that most research is going to try new approaches to solving problems which may

or may not make use of said field-specific frameworks since it focuses on finding new ways to develop

things. Also, there isn’t any publication(s) to our knowledge that addresses said problems. So, a proposed

16

work-flow using the tools and technologies under inspection is highly needed.

This project is an effort dedicated to tackle most of the above problems, if not all, by assessing currently

available frameworks that are free of charge for public usage. Only top frameworks are compared and are

demoted/promoted to be part of the work-flow towards faster prototyping of massively parallel applications

using the most comfortable yet efficient and timely enough approach from a programmer point of view.

The work-flow we try to develop follows primarily an ”evolutionary” approach in order to lessen the cost

of coming with a working product/prototype instead of producing a throw-away prototype. This is not to

mention, in addition, that fastest throw-away prototyping approaches are stated whether that is for the sake

of performance gains assessment and/or coming up with a fully operational production quality code.

Frameworks discussed are commonly available to public without charge and some focus on programmer

productivity the most, it would be best to utilize the most productive frameworks instead of just those based

on amount of features supported to promote one over the other(s). More about this later will be discussed

about the criteria of feature set evaluation in Chapter 2

1.3 About this project and contribution(s)

This project makes several contributions. The most important of which is to decide a clear path1 to fast

prototype and then if possible use such prototype of parallel programs to arrive at a final parallel software

product/algorithm. The following subsections, namely 1.3.1 and 1.3.2 discuss the contributions in more

detail.

1.3.1 Significance of this project

By doing this project, we hope to achieve the following goals:

• More adoption of the parallel programming models by mainstream developers[4]

• More productivity on relatively small time, budget and resource scale. [4]

• More understanding of parallelizing techniques by focusing on such techniques rather than optimization

details. [4]

• Fewer throw-away prototypes and learning to use the evolutionary programming model instead. [17]1We call it work-flow

17

• Speedup education for parallel programming and HPC for scholars and beginning developers using

simplified approaches instead of diving right into the intricate details of such difficult field.

• For my self, it deepens my understanding to explore more in the parallel programming field in order to

further my understanding and be more insightful in my further studies.

1.3.2 Industrial and Scientific Significance of this project

We think, by proposing faster prototyping work-flow(s) that at least one of the following goals will be

achieved:

• Parallel programming is considered an esoteric skill targeted at experts, this notion need be dealt with to

reduce costs of hiring programmers of said applications.

• Instead of taking the risk of investing in huge parallelization projects, small group of sequential programs

software engineers can use the proposed work-flow to see the feasibility of developing HPC programs

and then decide whether to go further or not after gaining/losing some level of confidence about the

parallelization effort(s) they put into a prototype.

• Prototyping gives reasonable estimate of how much resources should be put into the target application

to make it happen and hence it helps budgeting projects for organizations and institutions alike, and it

gives stake holders that don’t have deep knowledge in this field a chance to have a level of comforting

confidence making decisions towards investments in advance.

• After successful prototypes, returns of investments also can be estimated and impact can be foreseen

early in time before allocating too much resources ahead of time.

That has been said, it is hard to assess the level of success in achieving those goals. It is developer-

specific to decide what best works for his/her development efforts. However, we take what we normally

prefer to achieve as common sense of what others’ desires are towards achieving their goals using any

methodology they follow.

1.3.3 Difficulties, Intellectual challenges and risks faced

This effort has a lot of intellectual challenges and issues to deal with. While we don’t assume we may

face all of them, we may face couple at the least. Knowing these can help us achieve the goals and what

18

kinds of problems we may face later on in the project. This is in order to prepare to solve those issues the

earliest and fastest possible. Some of the challenges follow:

1. Trying to understand different ways and frameworks in which a parallel program can be developed is in

itself a challenge.

2. Comparing and evaluating frameworks against a feature set whose features are not always supported by

all frameworks/tools and trying to overcome such situation(s) is a problem.

3. Trying to integrate two or more frameworks to come up with a unified evolutionary programming work-

flow isn’t trivial as it requires thorough exploration of many frameworks specific features and trying to

hack those to our advantage is an art not easily mastered. That is, hybridization of one or more of the

frameworks is to be formulated to come over one or more limitations of the framework(s) of choice.

4. Some frameworks are not well documented, scarce samples and almost has no user-base to share expe-

riences and or learn from, CopperHead[4, 3] for example.

5. Some frameworks are just wrappers around CUDA, which may or may not be advantageous as we will

see later in chapters 4 and 2.

6. A lot of available frameworks to choose top subset from exist. So,instead of considering endless count

of them and/or their derivatives e.g. CUDA, PyCuda, jCuda, OpenCL, PyOpenCL ...etc, we need to trim

some out according to reasonable criteria that is related to our target, fast prototyping.

7. Some frameworks are not well documented and requires a lot of research and code reading to understand

how to utilize, mostly because they are still research projects, an example is CopperHead [4, 3].

8. Some frameworks are well-documented yet cumbersome to use as a prototyping frameworks. How-

ever, need to keep them since they may be potential complements to the framework(s) of choice after

understanding how they work and some, if not all, of their specifics. An example of that is PyCUDA [1]

1.3.4 How difficulties and risks were solved or avoided

There isn’t many options on how to resolve challenges after all since most frameworks are not yet

documented well enough for complete investigation. Only four ways we had in mind to figure out how

to solve the challenges since no formal references exist so far to support this research project, at least we

19

couldn’t find such references except per-framework documentation and paper publications, one exception is

Thrust[6]:

1. Online documentation

2. related published materials, papers...etc

3. Diving into the code of samples provided, frameworks/libraries implementation(s), or related work.

4. Final resort is conservatively communicating with maintainers of such projects when absolutely needed.

1.3.5 Organization of the document

We have covered the introduction and the needed background in order to set the grounds for our jour-

ney and issues we may face during our exploration. Several frameworks are then evaluated according to

criteria of interest and how they may/not integrate with each other to form a better behaving whole. The

framework(s) of choice is then discussed and explained in sufficient details to give necessary knowledge

for evaluating them according to targeted criteria and purposes, followed by the work-flows suggested in

the same chapter. Then a detailed description of CopperHead usage and constraints is presented, with jus-

tification why a whole chapter was included for this framework deferred to be explained by the following

practical exercise chapter, i.e. Chapter 4. A practical example of porting blackscholes to parallel targets

and trying to extend it beyond what has been achieved using CUDA C to show the work-flow effectiveness.

Blackscholes is the simplest form of a parallel programs, embarrassingly parallel with no data dependencies

between execution/computational threads2. This will set it as our subject parallel application for experimen-

tation and illustration of work-flows application. Ending remarks, findings and Appendices are then briefly

stated.

2To avoid miss-judging a framework because of poorly modeled data dependency resolution during computational algorithm

20

Chapter 2

Top Frameworks Evaluation: The feature

sets evaluation

21

During our survey for helping libraries and/or frameworks and tools to parallelize/synthesize parallel

programs, we came across a lot of interesting projects. A lot of which are targeting the similar goals to

our work, which is encouraging. However, many of them end up, in our opinion, complicate the process

instead of simplifying it, e.g. HPP [11]. More over, many of them don’t have the sufficient conciseness

and/or abstractions needed to shield the programmer from device-specific differences. We try to filter those

approaches to keep a minimal set of interest to us and make it more feasible to apply them later on for more

assessment.

Our methodology groups each similar subset of interest in one section and discusses their advantages/dis-

advantages to our interests. After that, by carefully studying them, the result is stated, i.e. the work-flow

suggested. We tried to experiment with all paths of our suggested work-flow but we couldn’t cover all of

them. Simply it is unfeasible to learn all of them and apply them all within the time frame. However,

those that are similar in nature to learn compared to CUDA C, we preferred to leave out since any one who

knows CUDA should find it feasible to learn for example PyCUDA, JCuda and/or OpenCL. This in essence

will enable us to explore the radically different approaches more and leave out the familiar ones to the end

users, we assume it is relatively easy for the user(s) to do a practical example evaluation of our suggested

work-flow that was not covered in this work.

2.1 PyCUDA [1, 2]

PyCUDA is a CUDA Driver API binding for Python. It is a low level API bindings that can be used to

develop programs that target NVidia’s CUDA devices, alone, with highest performance achievable being its

main goal. It has all the facilities available provided by the native CUDA Driver API except its programs

are mostly written in python and that facilities are accessed in a more object-oriented manner.

PyCUDA has many features that makes it a strong complement to other python based frameworks. It

has a device interface and properties that enable it to do device queries and all driver level calls 1. Ac-

tually it is a driver-level API [1]. It follows Python object oriented programming model and has excep-

tions that aid at debugging and abstractions to some level2. For example it has pycuda.driver and

pycuda.gpuarray.GPUArray as well as other useful classes. All objects allocated are automatically

cleaned by python environment, it won’t detach from a GPU context before all that memory is freed. “Com-1This link covers there Device Interface API: http://documen.tician.de/pycuda/driver.html2Several exceptions to the OO programming model: cudamalloc, cudamemcpy

22

pleteness. PyCUDA puts the full power of CUDAs driver API at your disposal, if you wish”. All CUDA

errors are translated into Python exceptions, auto error checking. It has all the speed and performance ad-

vantages as its CUDA C counterpart, the underlying layer is written in C++. Its documentation is thorough

and abundant and is open source. It is a base of many other python based frameworks e.g. copperhead

previously3 and PyOpenCL nowadays. PyCUDA integrates with numpy for scientific computing and nu-

merical processing. It is still actively maintained and is quickly expanding. More over, it has a wide user-

base and proven efficacy and pupolar in scientific research (many applications exist that are implemented

in PyCUDA) 4. Examples of such applications: Simulation of spiking neural networks, Time Encoding

and Decoding Toolkit, Sailfish-Lattice Boltzmann Fluid Dynamics, Recurrence Diagrams, LINGO Chemi-

cal Similarities, Filtered Backprojection for Radar Imaging, Facial Image Database Search, Estimating the

Entropy of Natural Scenes, Discontinuous Galerkin Finite Element PDE5 Solvers, Computational Visual

Neuroscience. PyCUDA is, also, interoperable with python to some extent, see disadvantages next to know

more. It has strong GL interoperability, support for multi-dimensional arrays on the GPU, and some support

for CUDA debugger.

One of PyCUDA’s biggest problems is that it is not fully object oriented around the API. For example, to

declare a kernel one has to write it in a string as s/he used to do in CUDA C, which is not debugger friendly

while writing it at the least. An example declaration and implementation of a kernel in PyCuda is shown in

Listing 2.1.1 while an example of how to call that kernel after defining its module is shown in Listing 2.1.2.

1 mod = SourceModule("""2 __global__ void multiply_them(float *dest, float *a, float *b)3 {4 const int i = threadIdx.x;5 dest[i] = a[i] * b[i];6 }7 """)

Listing 2.1.1: Example PyCuda kernel declaration and implementation is passed to pycuda as a string in amodule

Due to above, many disadvantages follow. It is hard to know how to debug a kernel while it is a string,

if it possible in the first place. The source of most bugs is due to indices calculations and threads flocking3Nowadays CopperHead implementation relies only on thrust and other transformations rather than basing it on PyCUDA4check the showcase page of the official website to see more examples.5Partial Differential Equations

23

on the GPU; that said,how is it possible to debug such bugs while using a string as a source code? To call a

function is not purely object oriented too, one has to refer to the kernel string in order to make a kernel call.

Following Listing 2.1.2 shows that based on the above example.

1 multiply_them = mod.get_function("multiply_them")2 ...3 multiply_them(4 drv.Out(dest), drv.In(a), drv.In(b),5 block=(400,1,1), grid=(1,1))

Listing 2.1.2: Example PyCuda kernel call for the above defined kernel in Listing 2.1.1

The need to learn another framework to program in CUDA C again but with a different style, that is

not completely a clean abstraction from the original API neither it simplifies the process of development is

counter productive. It maybe a slower process and may produce bugs never experienced before and then

debugging it maybe is harder since no clear source of where it starts to manifest.

Another disadvantage6 of it is that memory management for the device is not fully automatic e.g.

one has to call pycuda.driver.mem alloc(bytes), pycuda.driver.to device(buffer),

pycuda.driver.from device(devptr, shape, dtype, order="C"), alignment [14, 18],

pinned memory [19] [19, 1], memory pitch [18], the cuda module interface takes strings as CUDA C kernels

...etc to do memory operations and further code optimizations. This doesn’t help a lot in fast prototyping but

it is at the same time an advantage since it helps in optimization without resorting to another framework or

another programming language toolchain.

2.1.1 Other thoughts and possible workaround for some PyCUDA disadvantages

Some of the disadvantages like the ”string” based module declarations can be fixed using two solutions.

One, Instead of using strings to develop kernels, using meta programming like shown in listing 2.1.3 below

could help lifting some of the burden put on parallel applications developers.

Another workaround would be dropping back to programming in CUDA C and then importing the

modules from *.cu files as a string to be embedded and fed to pycuda modules programming interface. This,

though, is not a productive approach since testing and debugging of kernels means writing the whole micro-

application in CUDA C, in order to test using sample data from a file and/or the host interface to a function6This is an advantage for optimization but not for prototyping

24

1 from codepy.cgen import FunctionBody,2 FunctionDeclaration, Typedef, POD, Value,3 Pointer, Module, Block, Initializer, Assign4 from codepy.cgen.cuda import CudaGlobal5 mod = Module([6 FunctionBody(7 CudaGlobal(FunctionDeclaration(8 Value("void", "add"),9 arg_decls=[Pointer(POD(dtype, name))

10 for name in ["tgt", "op1", "op2"]])),11 Block([12 Initializer(13 POD(numpy.int32, "idx"),14 "threadIdx.x + %d*blockIdx.x"15 % (thread_block_size*block_size)),16 ]+[17 Assign(18 "tgt[idx+%d]" % (o*thread_block_size),19 "op1[idx+%d] + op2[idx+%d]" % (20 o*thread_block_size,21 o*thread_block_size))22 for o in range(block_size)]))])23 mod = SourceModule(mod)

Listing 2.1.3: Example PyCuda kernel and its call [1]

and timing. As a result, post cleaning from all extra parts is needed and in the middle of such process, it

may introduce a bug that may require going back again and fixing it before finally making it into the pycuda

module. 7

2.2 CopperHead [3, 4]

CopperHead is a parallelizing source-to-source compiler, run-time and an embedded language. Its em-

bedded language is a subset of python primitives e.g. map, reduce, filter that is to be elevated to a parallel

version to run on the target device(s). Later in Chapter 3 it will be discussed in detail since this is a neater

and more abstract approach to enabling parallelism for python developers than PyCuda, i.e. synthesizing

source code for target programs using static/dynamic analyses.

CopperHead lives to its promises in providing a very attractive abstraction scheme to developers in gen-7This is truly not a solution more than it being a workaround some limitations

25

eral, and to python pprogrammers in specific. It is Python based embedded language, i.e. programmers use

normal parallel primitives of python and then the platform elevates them to run on parallel devices [4]. It

emphasizes high productivity with focus on using parallel primitives as a way to express data-parallel com-

putation. That is easily expressible by a human being rather than manually flattening computation to parallel

target platforms and/or mapping to different hierarchies of parallelisms. CopperHead, also, Provides strong

performance-gain measure compared to CUDA hand written well crafted data parallel programs, please

make sure to check benchmarks conducted in Chapter 4 to be confident about how to correlate performance

of a CopperHead protptype compared to hand written CUDA C equivelant. It integrates and inter operates

seamlessly with python interpreter, libraries and other frameworks as is the case wuth PyCUDA. Copper-

Head compiler, rntime and all source code is open source with the front end written in Python while backend

is written in CPP. Both are advantagious since it is easier to add more features to front end while keeping

performance resonably fast by using the high efficiency CPP language. The runtime, source produced being

in C++ may mean it can rely on a more advanced compiler infrastructure in case needed be improved later

on by utilizing the mature optimizations introduced in C/C++ compilers already.

Like everything in life, CopperHead has couple of significant disadvantages. First one is that source

produced by the compiler isn’t that readable, modifiyable and/or re-useable in other native C/C++ programs

as was first commercialized in an older version publication, see 2010 paper published by the authors of

CopperHead[4]. In case of optimization steps, there is no way to use streams, know how many CUDA

devices are available on a single machine and no way of extending to multiple nodes of compute cluster. 8.

This is not to mention that according to our experiments, CopperHead doesn’t support multiple GPU’s on a

single host. Generated CUDA programs relies heavily on Thrust parallel primitives e.g. map, reduce , which

by itself is limited since it doesn’t make use of overlapping data transfers with computation and/or with other

asynchronous data transfers [14, 15, 16].9. On the otherhand, and according to our experiments, we suspect

that CopperHead in its core does that automatically and passes device pointers to do smaller thrust function

calls on streamed data. This was speculated based on the level of performance CopperHead version of the

benchmarks we experimented with performs compared to a well-crafted hand written CUDA C version of

the same benchmark on larger data sets only; those data sets are large and the use of asynchronous streams8Inferred from all examples and prelude APIs as well as publicly available APIs. This, however, isn’t absolutely necessary for

prototyping; instead, it will be of some importance if the developer is using the evolutionary development model proposed later9Thrust relies heavily on the default stream, which is highly synchronous and hence very much slower than other streams to

achieve the said overlaps

26

starts to take effect positively on performance gains as seen in a streams-based CUDA version of the same

benchmark. Getting back to some of CopperHead disadvantages, extending parallel algorithms to multiple

GPUs/devices require explicitly stating the place using Pythons with keyword., while still we can’t check

how many GPU’s are present10. While multiple GPUs support is to be seen in later versions, preparing

this as an aspect of interest is crucial for its success. Lastly, CopperHead compiler doesn’t do but simple

transformations to achieve good enough performance and it is not an autotuning compiler.

2.3 MPI for Python[5]

MPI for Python is at the name suggests. It is a small library with support to all necessary MPI specifica-

tion functions. The exploration of such library will be left to the user since it is highly analogous to regular

MPI implementations targeting other languages. However, what is unique about this library is stated next to

highlight some subtle differences.

MPI4Py uses Pickle and cPickle 11 to implement its underlying, network-based communication. It has

the duality nature of supporting both pure python objects and buffer-based MPI communication. It is also

important to mention that its speed of communication is ”near C-speed”. A lot of the familiar operations

are supported and available in two variations as mentioned before, the buffer based variations of functions

start with capitalized letter while the those with small letter case are for python objects. From this point, it is

easier to get started using the online documentation of mpi4py. An example is shown below in Listing 2.3.1

for the convenience of the reader who doesn’t want to explore more but only have a look at how it is used.

1 from mpi4py import MPI2 comm = MPI.COMM_WORLD3 rank = comm.Get_rank()4 if rank == 0:5 data = {’a’: 7, ’b’: 3.14}6 comm.send(data, dest=1, tag=11)7 elif rank == 1:8 data = comm.recv(source=0, tag=11)

Listing 2.3.1: Example Example MPI4Py showing object based communication between processes[1]10This is to be amended by our proposed workflow11Pickle is a serialization framework for python just like any other programming language that needs to serialize/marshall and

de-serialize /unmarshal objects to be sent and received on/from the network. CPickle is just the same implementation of Pickle butdone in C for higher performance. An object needs be pickle-able to be sent over a network and that is basically extending a class(just like java and the Serializable interface)

27

Another example showing interoperability with numpy arrays is also shown below in Listing 2.3.2

1 from mpi4py import MPI2 import numpy3 comm = MPI.COMM_WORLD4 rank = comm.Get_rank()5 # pass explicit MPI datatypes6 if rank == 0:7 data = numpy.arange(1000, dtype=’i’)8 comm.Send([data, MPI.INT], dest=1, tag=77)9 elif rank == 1:

10 data = numpy.empty(1000, dtype=’i’)11 comm.Recv([data, MPI.INT], source=0, tag=77)12 # automatic MPI datatype discovery13 if rank == 0:14 data = numpy.arange(100, dtype=numpy.float64)15 comm.Send(data, dest=1, tag=13)16 elif rank == 1:17 data = numpy.empty(100, dtype=numpy.float64)18 comm.Recv(data, source=0, tag=13)

Listing 2.3.2: Example MPI4Py showing numpy integration for higher performance communication be-tween processes [1]

The way to run above examples is by issueing the following command assuming the example to run is

placed in a file called runnable.py.

mpirun -n num_processes python runnable.py

The reason why MPI for python was included in the suggested work-flow is that in order to extend

to multiple nodes, there must be an efficient and high level communication framework between different

compute cluster machines/nodes. Hence, the best candidate that is standards compliant is the MPI imple-

mentation for Python. This framework is only partially investigated to serve our prototyping work-flow,

there could be more features where we don’t refer to here.

Some of its best features include: Platform independence, efficient (near C-speed), and standards com-

pliance. It is Python based, which integrates well with both CopperHead and PyCuda and uses pickle-able

python objects 12 as messages. Using it makes it easier to translate later on to any language-implementation

of MPI, e.g. MPICH2, OpenMPI ...etc, in case of porting the final version of the prototype to another

language framework, e.g. CUDA C, Thrust.12Pickle-able python objects are objects that are serializable/marsh-able using the Pickle or cPickle serialization frameworks;

python frameworks for marshaling objects from/to python.

28

Similar to its peers, it has some limitations. Only a very small subset of MPI specification is implemented

so far. 13. It can not utilize the GPUDirect capabilities to do pipelined optimized transfers between different

nodes on a GPU cluster. [20] Added complexity of program developed due manual programmer work to

make intermediate transfers between GPU, Host and MPI calls instead of utilizing the GPUDirect RDMA

(Remote Direct Memory Access) by passing GPU buffers right away to MPI fuction calls.[20]14

2.4 Thrust - A Parallel Algorithms Library [6]

Thrust is a C++, open source, parallel algorithms library tuned to enable productivity for mainstream

developers that have minimal background about parallel processing. It hides parallel applications imple-

mentation details by providing a comprehensive set of parallel primitives that can be composed to achieve

performance gains through parallelism. It originally targetted CUDA devices but in a latest versions( i.e.

version 1.6 and later) support for Intel’s TBB and support for OpenMP were added. It is, in essencs, the

STL (Standard Template Library) for parallel programming. It is both highly customiable, composable and

extensible. [6]

Its rating amongst other CUDA API, highest abstraction first to lowest last, is the following:

1. Thrust

2. CUDA C API

3. CUDA Driver API

It is important to mention at this point that while CopperHead can be rated in the same abstraction level of

Thrust, it’s ourput source code isn’t quiet human readable nor optimizable. Also, PyCuda is a CUDA Driver

API’s [1] binding for python which makes it even more difficult to transition to an optimized version of the

prototype. The combination of both CopperHead limitations with PyCUDA’s makes it impossible to use

the prototype produced based on them be used as an ”evolutionary prototype”. Therefore, we recommend

picking Thrust over CopperHead and PyCUDA since the resulting prototype from CopperHead would most

probably be a throw-away prototype which is a great waste of developers’ productivity and a re-start in13However sufficient for the workflow purposes and other more advanced functions can be manually implemented on demand if

absolutely needed.14Reader(s) is/are highly recommended to read Jiri Kraus’ Article[20] to have an introductory understanding about the GPUDirect

technologies and how they can be used.

29

the PyCUDA optimized application. The only relevance of keeping CopperHead is that it is a stronger

performance gain indicator than Thrust as a prototype for the target parallel application. Thrust clearly has

the edge in many aspects as we will see shortly and especially with regards to our evolutionary prototyping

workflow.

Thrust is packed with production quality strengths. The heavy use of templates makes algorithms adapt-

able to any data type be it complex or basic, i.e. primitive like int, float or even structs or class objects.

Thrust have all the good abstractions from CopperHead and more. Features missing from CopperHead

like choosing heterogeneous target devices abstractly and explicitly by using device-specific containters and

passing them to algorithms to proceess these input(s) abstractly and without the developer worrying about

implementation details. Thrust is C++ based and integrates very well with the lower level CUDA C APIs,

which makes it perfect for a transition from a completely abstract implementation of a parallel application

to an optimized one and/or hybrid final version.[6, 21] Since it is C++ based, it also can be profiled and

debugged using the mature set of tools available for CUDA C. This simple fact is due it being an abstraction

of CUDA C backend and an abstraction of other native backeneds like TBB and OpenMP. [6, 21] Also,

it doesn’t only have all the primitives available to CopperHead, but has more fused ones15. Thrust isn’t

only replacing CopperHead, but also replaces PyCuda since it has all needed capabilities from PyCuda that

are provided to CopperHead built into thrust and CUDA C.[3, 6, 1] Thrust takes advantage of peer-to-peer

memory transfers between multiple GPUs without going back to host memory16 on a single system while

CopperHead doesn’t17. [6] Support for scalability to multiple GPUs is built into the framework, and more

custom distribution between multiple GPUs can be done in CUDA C if needed. [6] Easily Extensible using

custom classes, functors18, and iterators and is flexible for composing multiple custom algorithms using

the available primitives and inheritance features of its classes. [6, 21] Transfers between heterogenuous

memory spaces is also implicit, no need for a programmer to manually implement such transfers manually

except if there is a performance gain to be achieved and it is left optional to the programmer. [6] Generally

speaking, Thrust doesn’t impose constraints on programmers to be utilized except it recommends a set of

best practices, unlike CopperHead constraints. Last but not least, thrust functions accept pre-allocated de-15Fused primitives e.g. texttttransform reduce, or map and reduce fused in one algorithm for speedup benifits16This feature may slow performance if the subject GPU(s) are not capable of peer-to-peer transfers using what is called UVA

(Unified Virtual Address) space due to Thrust emulating that by two separate calls of memory transfers from one GPU to Host thenfrom Host to the other GPU.

17Since CopperHead has only support for a single GPU per host system so far18A struct with overloaded operator “()”

30

vice pointers which makes it best to utilize abstract algorithms and asynchronouse transfers using streams

to/from the device. C++ isn’t the most developer friendly language like Python, however thrust makes a

great amendment to that through excellent object oriented abstractions. [6]

While it has a lot of advantages, it has one short coming noticed later in the experiments. It performs

a lot worse than CopperHead and CUDA C in case asynchronouse device transferes where not manually

crafted by the programmer. Some may view this as a side disadvantage but we view it as a major one since

the performance in our experiments was slower by orders of magnitude which forces developers to drop to

manual transfers.

An example directly taken from Thrust website is shown below in Listing 2.4.1, to give the reader an

impression of how abstract thrust is:

1 #include <thrust/host_vector.h>2 #include <thrust/device_vector.h>3 #include <thrust/generate.h>4 #include <thrust/sort.h>5 #include <thrust/copy.h>6 #include <algorithm>7 #include <cstdlib>8 int main(void)9 {

10 // generate 32M random numbers serially11 thrust::host_vector<int> h_vec(32 << 20);12 std::generate(h_vec.begin(), h_vec.end(), rand);13 // transfer data to the device14 thrust::device_vector<int> d_vec = h_vec;15 // sort data on the device (846M keys per second on GeForce GTX 480)16 thrust::sort(d_vec.begin(), d_vec.end());17 // transfer data back to host18 thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());19 return 0;20 }

Listing 2.4.1: Example Thrust sort example showing how abstract it is to use Thrust like any notrmal C++library without worrying about the device specific details[6]

2.4.1 Features viewed differently

There are couple of best practices advised by Thrust developers. First, to achieve higer performance out

of the box before the need to optimize using a lower level API’s, data has to be encapsolated in a Struct of

31

Arrays, where each array is an attribute for the tuple to be processed to obtain the output. This is to assure

memory coallesced accesses in contrast when using Array of Structs. [6] A developer is recommended

to search for any ”fusable” pair(s) of operations and do fuse them using the parallel fused algorithms to

achieve higher performance.[6] Best practices also recommend programmers to search for any data sets that

are reusemble one of the constant, counting, and other types of iterators and replace such memory accesses

with those iterators to remove memory bandwidth limitations and achieve higher performances. [6]

Again, the last three items can be both disadvantageous or advantageous when looked at from a different

perspectives. Therefore, we choose not to characterize them accordingly.

2.5 Best of all frameworks/libraries

The best is not based on best performance in terms of running times. Even though at the end of the work-

flow proposed in 2.6 may make little difference for approximation purposes and/or deployment reasons only.

Those frameworks that induce more coding and/or stricter programming style that maybe verbose are

also unwanted. Any thing that takes time is considered a huge negative for rapid prototyping, hence why it

is called rapid since it consumes little time at worst.

For the developers who try to evolve the prototype into an actual production quality application, then

performance becomes an issue only at the end of the development process. However, care was taken into ac-

count by choosing frameworks that perform near-best, if not best, to avoid the need to come up with a differ-

ent workflow just because of that. Moreover, frameworks evaluation also relies on how the chosen/assessed

framework(s) closely resembling the production quality framework of choice, in this case CUDA C. This is

to also smooth the process of either transitioning to either side of the process, actual CUDA C implementa-

tion or the framework(s) elected.

At the end, it can’t be emphasized more that time is the most important factor so anything can reduce

the time needed to deploy a prototype/final product is considered best in this work.

2.5.1 Criteria of Comparisons

Priority is considered in the following order: easy to (learn, use and get applications running right),

Integrates and plays well with other complementary frameworks to come over some/all weaknesses, inter-

operability with native environment frameworks/libraries, performance scaling is measurable in an apple-to-

32

apple manner rather than predicting levels of performance of completely different bases, and programmer-

productivity (e.g. time taken to complete the implementation, lines of codes to be written compared to other

platforms/languages).

Also, manual memory management and manual indices calculations/mapping are meant to be kept min-

imal if at all. Better non too.

Table 2.1 summarizes the list of features that we have considered with chosen based on what they do or

don’t support.

Table 2.1: Summary of most important features from several

frameworks that are of interest to fast prototyping of parallel ap-

plications

Feature PyCUDA [1, 2] CopperHead [3, 4] Thrust [6]

Abstraction Level19 1 3 3

Unified Memory Address

Space (managed)

No Yes Yes

Manual Memory Man-

agement

Yes No Yes (optionally)

Debuggers available Yes (Website states

cuda-gdb can be used

but it doesn’t support

host side code de-

bugging, because it is

python code)

Yes (only when python

interpretter is used)

Yes

IDE Integration No (When it comes to

developing kernels it

has non)

Yes (Since it is a sub-

set of python)

Yes

Continued on next page

19Most abstract is number 3, least is number 1

33

Table 2.1 – Continued from previous page


Profiling Tools No No (not good for

detecting performance

bottlenecks)

Yes

Parallel Primitives readily

available

No Yes Yes

Primitives Fusion (for

performance)

No No Yes

Threads/data Mapping Explicit Implicit Implicit (can option-

ally be Explicit by re-

placing it with CUDA

C)

Prototype achievable Evolutionary Throw-away any: Evolutionary and

throw-away

Auto-memory manage-

ment

No Yes Yes (partially, ab-

stract)

Object Oriented Partially Yes Yes (and can option-

ally drop to lower

CUDA C API)

Constraints imposed at

Programmers

No Yes No (only best prac-

tices)

Performance optimization

potential of the prototype

Strong Weak Strongest (two levels

of optimization, dur-

ing prototyping and

during optimization

workflow)

Continued on next page

34

Table 2.1 – Continued from previous page


Speed/Productivity of

producing a working

prototype

Slow Fast Fast

Selecting Specific GPUs Yes No Yes

Backends Supported (Par-

allel Target Devices/plat-

forms)

CUDA CUDA, TBB,

OpenMP

CUDA, TBB,

OpenMP

MPI integration for

multinodes clustered

setups

yes yes yes

Documentation Thorough Scarce Thorough

Sample code(s) A lot (theoretically

any CUDA C code is

pluggable in PyCUDA

if no headers are

needed)

Limited subset of

working samples

bundled with its

source/distribution

Abundant (Even

CUDA C ones can

integrate and interop-

erate very well with

it)

User base big almost non-existent big

Low-Level, device-

specific, or fine-grained

thread control

Yes No No (but possible

through CUDA C

when needed)

2.5.2 Why One chosen over other(s)

In the following subsections we will state in some detail how we promoted one framework over similar

several others. Those that complement each other are desirable to be promoted/demoted together if one of

them fails. However, one of them may play an important role not necesarily in all steps of prototyping and

35

hence may be promoted towards our rapid prototyping work-flow.

jCuda vs PyCuda vs CUDA C/C++

All are similar and are related except in different languages. [1, 2, 22, 14, 15] Also, all are trying to

achieve one goal, nVidia CUDA Compatible GPU programming, using the exact same model of CUDA C.

[1, 2, 22, 14, 15]

Not CUDA C

CUDA is C-based with lots of distractions for prototyping and lots of manual programmer work need be

done just to achieve small task(s). It puts a lot to programmers’ efforts to implement applications [14, 15]

Tuning isn’t easy and very time consuming and not guaranteed to achieve best, or near best performance, to

asses worthiness of parallelizing. In addition, it is not easy to predict acceleration results early in time.[15,

14] It is the targetted final language in the first place we started from to find a faster prototyping means and

then maybe used as a production quality applictions.

Not jCuda

JVM 20 and all the mysteries behind and surrounding it and how it works makes performance unpre-

dictable. Just-in time compilers may kick in any time which may make results of performance measurements

inconsistent. Java is a very verbose and is strictly typed language. The focus is on correctness and safety of

massively parallel code not on type-safety and/or security. No frameworks integrate with it that can comple-

ment its weaknesses to be fit for rapid prototyping. [22] MPI implementation that are standards compliant

are not yet available for it to be extended to multinode clusters neither performance is predictable when it

becomes available. It is just more of a bindings library around CUDA API which adds no value to program-

mers productivity over CUDA runtime API. Amongst its limitations is the use of Pointer objects just to

work around Java’s missing feature of pointers compared to C.

PyCUDA is the right way to start

PyCUDA integrates well with numpy, python, and copperhead. [4, 3, 2, 1] It is, also, high performing

with C++ backend [1, 2]. The best use case of it is to complement CopperHead for device queries since

CopperHead doesn’t have this invaluable feature. [4, 3, 1, 2] The Python language is a high productivity,20Java Virtual Machine

36

concise language with automatic memory management 21 22. Python is very well known language in scien-

tific computing and has a lot of supporting frameworks to achieve it [1]. In addition, most massively parallel

applications are scientific computing and are developed by scientists who are usually not computer science

majors. As a result, it is a best of breed language with high flexibility one can’t ignore as a serious choice

of achieving complex computations.

PyCuda vs CopperHead vs MPI for Python

In this section, we try to decide which one is fit best for what workflow step/stage. Each of them

complements each other. By stating each features that are not available in the other, it will clarify why they

go hand in hand and complement the whole proposed workflow.

PyCUDA alone isn’t enough for fast prototyping of GPU programs. Kernels when expressed as strings

inside the module object for the parallel program isn’t developer friendly, but a little debugger friendly

please refer to Table 2.1 to know more.[1] It uses exactly similar to CUDA C programming model with all

the complexities involved. Actually, even worse since we can debug kernels and host-side code natively in

the CUDA C programming environment[14, 15] while its not the case for PyCUDA, although there is cuda

debugger support for kernels-only debugging, host-side code can’t be debugged by using another Python

debugger since it is Python code and not C/C++ code. This makes it harder to trace a bug starting in host

code and manifesting in GPU computation for example. [1].

On the bright side, and since PyCUDA is Python based, it has streams and optimizing capabilities

normally available in CUDA C that makes it a strong candidate for the evolutionary programming model.

Furthermore, it complements Copperhead by providing a means to make device queiries in order to scale

CopperHead program(s) using multiple GPUs, in case this capability was ever added to CopperHead, using

the places construct. This is all do-able without the need to program kernels in PyCuda using strings with

the little debugging ability, except the availability of OO exceptions that wrap errors returned from executing

such string-based kernels implementations. Since PyCUDA is greatly similar to CUDA C programming

model, it makes it easy to produce similar version of the prototype completely implemented in CUDA C,

not to mention CUDA C programmers are very much experienced and won’t feel much estranged using21Auto memory management in PyCUDA is partial, however in Python and CopperHead memory is partially managed and

completely automanaged, respectively22Also, in order for CopperHead to benifit from auto memory management, a developer has to conform to constraints and

restrictions set forth by the framework and stated later in this litrature, i.e. using cuarray to wrap numpy arrays intended to be usedin a device computation.

37

PyCUDA.

CopperHead as was mentioned earlier doesn’t have any notion of streams but PyCuda does [1, 2]. This

won’t be complementary to CopperHead since its memory transfers are managed to maintain productivity,

readability and to support simultaneous target devices e.g. Multicore CPUs using OpenMP, TBB and mas-

sively parallel GPUs such as nVidia CUDA devices. However, PyCUDA is needed later in the workflow

when transitioning to the optimization phase. Copperhead actually used to use pycuda and codepy library to

achieve some of its functions, e.g. generating target source code from CopperHead python code, although

newer implementations of CopperHead are independent from PyCUDA. [4] Copperhead also does produce

the “target cuda C++ code” which, according to the latest publication in [4], can easily be copied and fed

into a CUDA C program and conceptioally into PyCUDA, readily made by copperhead and sure to be cor-

rect, without the need to debug and to come up with a working PyCUDA-based application and replacing

all CopperHead functions with it. This would be a significant part of the optimizing phase of the application

that has been done by CopperHead. However, and according to our investigation and the maintainer com-

ments, it is not possible to do so. So, the code generated by CopperHead isn’t readable enough, modifiyable

or even re-useable in target PyCUDA and/or CUDA C programs.23 CopperHead doesn’t have a notion of ex-

plicit ”synchronization” between threads, it does that implicitly by relying on inherently parallel primitives

stated in its Prelude primitives e.g. map, reduce, scan, rscan ...etc. This also simplifies a developer’s work of

figuring out synchronization points. Further more, if synchronization is needed, there are only two ways to

do it based on the situation: trying to access data produced by async launch of a kernel can be done through

accessing one or more of its results variable in which case the program will block till the producing kernel

is done generating results, another situation is that when using multiple copperhead functions to produce a

whole kernel is treated similarly as the one before by the copperhead compiler that is producing the target

code (i.e. C++, CUDA C code). [4]

Thrust vs CopperHead

Since thrust is the absolute best in almost all aspects of our evaluation criteria, it will be chosen as our

best prototyping framework of choice. This, also, is because it will support the ”evolutionary prototyping”

workflow from start to end while PyCUDA bundled with CopperHead are some what disjoint in supporting23The other part of the optimizing part of evolutionary programming model is the use of streams and manually tuning kernels

produced by copperhead as CUDA C output. For other source code output as well i.e. OpenMP and TBB code.

38

such workflow. As a result, we will present two different workflows:

1. A throw away prototyping workflow using CopperHead, that then can be replaced by an optimized

version of the prototype using PyCuda. This is tuned towards Python Developers.

2. An evolutionary prototyping workflow that will keep most if not all of its already implemented parts

intact with minor modification to come up with a possibly hybrid implementation of the over all parallel

application, and hence acheiving ultimate programmer productivity.

2.6 Best prototyping-to-production Workflow Proposed

2.6.1 C++-based and an evolutionary prototype

Although Thrust-based prototyping workflow seems a little longer, it really isn’t that much longer but

a little more detailed to a finer grained tasks rather than coarse grained tasks happening in Section 2.6.2

discussing the throw-away python based prototyping to production workflow. The steps taken for the ”evo-

lutionary prototyping” workflow follow:

1. Construct a C++ sequential code to make sure normal tasks are working, i.e. file reading, memory

allocation, functors working as expected when mapped to array of inputs ...etc.

2. Identify potentially parallelizable code fragments

3. parallelize those fragments using Thrust primitives and fuse those that are fuse-able using thrust fused-

primitives.

4. scale to how many devices and nodes needed/available, by utilizing the Peer-to-peer transfers between

single-node GPUs and utilizing CUDA-aware mpi calls [20] to take advantage of RDMA 24 and other

GPUDirect CUDA ffeatures.

5. In case asynchronouse streams were needed to make asynchronous memory transfers between host and

device manually, thrust memory transfers can be replaced by ones utilizing pinned/mapped[21] memory

and then passing the pointers to thrust function calls. This allows smaller transfers and overlapping

transfers with other transfers and/or with computation time.24Remote Direct Memory Access

39

6. Debug and test as in sequential C++ code.

7. during optimization step, profile using tools provided for CUDA developers to identify bottle necks.

8. CUDA C APIs can be used to further achieve more performance, in case absolutely needed.

9. Test, Debug and profile again as needed

10. deploy when done.

For a more detailed workflow, please refer to Figure 2.1.The optimization step to follow an evolutionary

develpment model to arrive at a more optimized prototype and/or production code is shown in Figure 2.3

and discussed in Section 2.6.3.

2.6.2 Python based yet a throw-away prototype

The steps needed to make a ”throw-away” prototyping workflow that is based on Python frameworks

follows. However, in case a developer needed start with PyCUDA and skip prototyping using Python and

CopperHead, it is mostly like normal CUDA C workflow.

1. Make a Python-only sequential application

2. Parallelize fragments of codes using Python normal parallel primitives and then replace function calls

with CopperHead to parallelize on the target device(s). This is done by: adding decorations, conforming

to constraints set by the embedded language specifications 25, divide the work and deploy.

3. Utilise the meta programming and fine-tuning to specific hardware capabilities of PyCCUDA, e.g. by

determining how many devices, capabilities of devices , size of caches, and other device-specific prop-

erties if needed.

4. Extend the parallel application to multiple nodes of GPU cluster and/or Multicore nodes by using

MPI4Py [5].

5. Final production quality can be one of the following implementation: CUDA C/C++, PyCuda since

it is just bindings around CUDA C API. However, programmers must be catiuos that a complete re-

implementation in PyCUDA or Thrust/CUDA C is required since code produced by CopperHead isn’t25e.g. variables are bound not assigned so they are immutable once created, returning from all branches of execution paths, tuple

binding not indexing

40

use-able in the final production code, only this code written in pure python may be used in the imple-

mentation of the application when PyCUDA was the choice for final production code.

Figure 2.2 shown below illustrates how CopperHead, PyCuda, and MPI for Python can be used to

prototype and extend that prototype from one device, to multiple heterogeneous ones and reaching clustering

using multi-nodes setups. In this regard, we are reminding the reader that CopperHead prototype is a throw-

away, meaning most parallel code, if not all, is to be re-implemented during the optimization step in the

chosen production framework/language.

2.6.3 Optimization of both C++ and Python based Prototypes

This section is best explained in Figure 2.3. The first step after making a prototype work is to optimize

parts of it incrementally by either replacing code that is not optimum or re-implementing parts of a program

that needs finer grained control over lower level operatins using a less abstract framework/API. Lower level

frameworks would be CUDA Runtime API preferably for Thrust based prototype and PyCUDA preferably

for CopperHead/Python prototype.

2.7 Related work

In our survey step of finding abstraction frameworks of paralleizing code, we briefly looked at the

following frameworks and decided not to go further exploring for the reasons stated below for each:

GPU C-based: OpenACC [23] - directive and functions based parallelization that is not production

quality to date. This is, however, one of the cleanest directive based paralleization frameworks. It defines a

limited subset of parallel constructs e.g. reduction, acc async wait all ...etc. Also, it defines a set of helper

functions e.g. acc on device, acc malloc, acc free ...etc to help in finer control over the parallelization

process and its specifics. An example is shown in Listing 2.7.1.

OpenCL - We thought that it is too general that is even more cumbersome to work with than normal

CUDA C and is not as efficient. The good goal of this framework is that it targets GPUs from different

vendors rather then focusing only on nVida GPUs. Otherwise, this was a strong candidate for device-

independent prototyping-to-production quality platform.[15, 14]

PyOpenCL - is an OpenCL python wrapper library/framework that replicates OpenCL but in python to

enable Python programs interoperability. Again, it is too generic based on OpenCL and decided against

41

1 void saxpy(int n, float a, float *x, float *y)2 {3 #pragma acc kernels4 for (int i = 0; i < n; ++i)5 y[i] = a*x[i] + y[i];6 }7 ...8 // Perform SAXPY on 1M elements9 saxpy(1<<20, 2.0, x, y);

Listing 2.7.1: OpenACC tiny example to show how directives are used to parallelize a simple program;SAXPY [9]

exploring it based on limitations stated above about OpenCL. OpenMP - Still Good but only in the context

of CUDA C and C/C++ programming, but Thrust covers that for us so we don’t have to deal with directives.

However, it is excellent for production quality C/C++ multi-cores parallel code for those who want to use it.

MPI - same things stated as OpenMP except it is used for communication across multiple processes

instead of threads. It is a great high performance inter-process communication library that has to be used by

any C/C++ programs that need to extend to multi-nodes clusters. Nevertheless, It is covered extensively in

each implementation’s documentation(s) and examples are abundant over the internet in case reader needs

to learn about them. Furterthermore, it wasn’t really left out of our efforts. Quiet the opposite, we thing it

is a strength for CUDA to integrate very well with MPI through the GPUDirect by using RDMA, that is by

passing CUDA device pointers right away to MPI calls to allow pipelined transfers between devices maaged

by different MPI processes.

Heterogeneous C-Based:

OpenHMPP - Directive based and again C/C++-based. Also, learning all the #pragmas and how to use

them, let alone how much extra code added and directives, just to parallelize a ”recognized by developer”

code fragments is truly a challenge by itself.

HPP - Heterogeneous Parallel Primitives [11]: It is a similar approach like Thrust library using C++

language but targeting both heterogeneous parallel platforms. Furthermore,it addresses task-parallel and

data-parallel problems at once. While this may be the most interesting amongst C++-based frameworks to

explore. It certainly is NOT fit for ”rapid prototyping” in the cleanest way possible and fastest, least cost

possible way. The reason is that programmers has to explicitly discover and annotate data parallelism using

constructs available from the framework. While this is the same thing copperhead programmers will do,

42

1 int main(int argc, char **argv)2 {3 #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size}

4 #pragma hmpp sgemm advancedload, args[vin1;vin2;vout], args[m,n,k,alpha,beta]5 for ( j = 0 ; j < 2 ; j ++) {6 #pragma hmpp sgemm callsite, asynchronous, args[vin1;vin2;vout].advancedload=true, args[m,n,k,alpha,beta].advancedload=true7 sgemm (size, size, size, alpha, vin1, vin2, beta, vout);8 #pragma hmpp sgemm synchronize9 }

10 #pragma hmpp sgemm delegatedstore, args[vout]

11 #pragma hmpp sgemm release

Listing 2.7.2: OpenHMPP tiny example to show how directives grow huge for the simplest parallelloop(s)[10]

copperhead doesn’t require additional writing of code and/or annotations using any of the above approaches

[3, 4]. Copperhead uses data parallel python primitives as its own primitives by intercepting the calls to

them if they happen to foll in a function annotated as @cu, i.e. a copperhead function [3, 4]. By looking

at the sequential matrix multiply function implementation shown in Listing 2.7.3, we can see how much

the code grows producing the parallel version using HPP framework shown in Listing 2.7.4. This is not to

mention how the use of template classes/functions complicate the code and hinder the clarity of the overall

code.

Heterogeneous other language:

Swift26 - Parallel scripting language: we beleive that the need to learn a new programming language to

rapidly prototype is counter productive in its own, also it has a strange un familiar syntax [12] and doesn’t

integrate with other frameworks. Regardless, by its own is a proven excellent language for applications al-

ready developed using Swift. Moreover, an interesting scenario is how to map it back to original frameworks

e.g. CUDA C to do further optimization. It isn’t a straight route like it is the case with CopperHead/Thrust

since deciphering swift back is another challenge. It has some drawbacks amongst which is to edit *.dat

files to specify external executables that form the part of the instructions/processing to be done on input

data. An example application that capitalizes letters stored in input files using unix tr command follows in

Listing 2.7.5. Before the program can run correctly, a *.dat file needs be created specifying some needed

executables.26We refer to SwiftScript as Swift. While the first is the scripting language and the second is the whole environment, we do this

as a short

43

1 void matrixMul(2 int size,3 double * inputA,4 double * inputB,5 double * output)6 {7 for (int i = 0; i < size; ++i) {8 for (int j = 0; j < size; ++j) {9 double sum = 0;

10 for (int k = 0; k < size; ++k) {11 double a = inputA[i * size + k];12 double b = inputB[k * size + j];13 sum += a * b;14 }15 C[i * size + j] = sum;16 }17 }18 }

Listing 2.7.3: A sequential matrix multiply written in C++ to be transformed into parallel MatrixMul usingHPP constructs [11]

1 type messagefile;

2 app (messagefile t) greeting (string s) {

3 echo s stdout=@filename(t);

4 }

5 app (messagefile o) capitalise(messagefile i) {

6 tr "[a-z]" "[A-Z]" stdin=@filename(i) stdout=@filename(o);

7 }

8 messagefile hellofile <"capitalise.1.txt">;

9 messagefile final <"capitalise.2.txt">;

10 hellofile = greeting("hello from Swift");

11 final = capitalise(hellofile);

Listing 2.7.5: Swift is a proven scientific language but its syntax is indeed confusing and isn’t exactly fast

and easy to learn for prototyping. Program taken from a beginners’ tutorial from Swift official website [12]

Chapel: lately and only after we are almost done with this work, we knew there was an elegant abstract

44

1 void matrixMul(2 int size,3 Pointer<double> inputA,4 Pointer<double> inputB,5 Pointer<double> output)6 {7 Task<void, Index<2>> matMul(8 [inputA,inputB,output]9 (Index<2> index) __device(hpp)

10 {11 unsigned int i = index.getX();12 unsigned int j = index.getY();13 double sum = 0;14 for (unsigned int k = 0; k < size; ++k) {15 double a = inputA[i * size + k];16 double b = inputB[k * size + j];17 sum += a * b;18 }19 output[i * size + j] = sum;20 });21 Future<void> future =22 matMul.enqueue(Range<2>(size, size));23 future.wait();24 }

Listing 2.7.4: HPP example program illustrating the use of different framework constructs to achieve het-erogeneous devices and task-parallel and data-parallel computations from C++ code [11]

Object Oriented programming language with heterogeneous parallelism bundled in the heart of it rather

than being an extension of another language or a library that does little no advantage utilizing device-

specific performance parameters. This wasn’t covered in our work since this wasn’t discovered but very

late. However, since we knew about it, we were curious if it had GPUs in mind. There seems to be support

for GPUs and it seems to do a great job at performance level compared to other native frameworks e.g.

compared to PGI CUDA C in both productivity and performance grounds. Moreover, it targets different

GPGPU architectures rather than being mostly nVidia specific, like our work is. This language is backed by

Cray, the supercomputing company, and it is highly portable between architectures. Shown below, are codes

for different implementations of the same STREAM Triad written in CUDA Listing 2.7.6, Chapel targetting

a GPU Listing 2.7.7, Chapel targetting multicores Listing 2.7.8.

45

1 #define N 2000000

2 int main() {

3 float *host_a, *host_b, *host_c;

4 float *gpu_a, *gpu_b, *gpu_c;

5 cudaMalloc((void**)&gpu_a, sizeof(float)*N);

6 cudaMalloc((void**)&gpu_b, sizeof(float)*N);

7 cudaMalloc((void**)&gpu_c, sizeof(float)*N); dim3 dimBlock(256);

8 dim3 dimGrid(N/dimBlock.x );

9 if( N % dimBlock.x != 0 )

10 dimGrid.x+=1;

11 set_array<<<dimGrid,dimBlock>>>(gpu_b,0.5f,N);

12 set_array<<<dimGrid,dimBlock>>>(gpu_c,0.5f,N);

13 float scalar = 3.0f;

14 STREAM_Triad<<<dimGrid,dimBlock>>>(gpu_b,gpu_c, gpu_a, scalar, N);

15 cudaThreadSynchronize();

16 cudaMemCpy(host_a, gpu_a, sizeof(float)*N, cudaMemcpyDeviceToHost);

17 cudaFree(gpu_a);

18 cudaFree(gpu_b);

19 cudaFree(gpu_c);} // end of main

20 __global__ void set_array(float *a, float value, int len) {

21 int idx = threadIdx.x+blockIdx.x*blockDim.x;

22 if(idx < len) a[idx] = value;}

23 __global__ void STREAM_Triad(float *a, float *b, float *c,

24 float scalar, int len) {

25 int idx = threadIdx.x+blockIdx.x*blockDim.x;

26 if(idx < len) c[idx] = a[idx]+scalar*b[idx];}

Listing 2.7.6: STREAM Triad done in CUDA C, not including the necessary host code to run it which may

be is significant too[13]

46

1 const alpha = 3.0;

2 config const N = 2000000;

3 const space = [1..N] dmapped GPUDist(rank=1);

4 var A, B, C : [space] real;

5 B=0.5;

6 C=0.5;

7 forall (a,b,c) in (A,B,C) do

8 a = b + alpha * c;

Listing 2.7.7: STREAM Triad done in Chapel ant targetting GPUs, the whole code ready to run[13]

1 const alpha = 3.0;

2 config const N = 2000000;

3 const space = [1..N] dmapped Block(boundingBox=[1..N]);

4 var A, B, C : [space] real;

5 B=0.5;

6 C=0.5;

7 forall (a,b,c) in (A,B,C) do

8 a = b + alpha * c;

Listing 2.7.8: STREAM Triad done in Chapel and targetting a cluster, ready to run[13]

It is important to bring to the reader(s) attention that stating some of the explored frameworks are not

interesting in our context, rapid prototyping, doesn’t mean it can not be used or is not interesting by itself.

Actually, all of the explored frameworks that were not chosen as a prototyping platform are indeed interest-

ing and innovative. Three of which to mention specifically as strong examples are OpenACC [23], Swift[12]

and HPP [11]. Chapel [13] is the best language that targets portable, high productivity HPC development

that we saw during our survey.

47

Figure 2.1: Proposed Work-flow diagram illustrating the first major step in Rapid Prototyping of Parallelprograms using Thrust

48

Figure 2.2: Proposed Work-flow diagram illustrating the first major step in Rapid Prototyping of Parallelprograms using Copperhead, PyCuda and MPI4py

49

Figure 2.3: Using an evolutionary development model, one can optimize or migrate to another parallelplatform starting from a prototype2.2

50

Chapter 3

CopperHead: An Embedded Data Parallel

Language

51

3.1 Primer

CopperHead is a source to source parallelizing compiler, runtime, and a subset of Python language

embedded into Python language itself. [3, 4]. It is open source and is only available as a source distribution

to compile and install manually after installing all required dependencies [3]. Once installed, using it is just

like using regular Python1 programming language. There are, however, some constraints and restrictions of

how code to-be paralleized has to be written. In this chapter, we will thoroughly explore all aspects about

CopperHead use. Two reasons behind dedicating a whole chapter for CopperHead is that it showed to be

competent in terms of binary performance produced compared to hand written CUDA programs, and it is

highly abstract yet does some transformations that are interesting for advancements of future work in the

direction of synthesizing optimized parallel code.

Originally, Copperhead relied on PyCUDA for its implementation, latest versions however is a complete

re-write of the original version without the reliance on PyCUDA2. Previously, Copperhead relied on ”nested

parallelism”3 [4], nowadays on the other hand it doesn’t and modeling parallel programs is done in a similar

fashion to CUDA having multiple device functions to compose a whole kernel [3], which is much easier

since it follows a similar-to-imperative language approach of composability.

The first working version of it supported only one back-end, i.e. CUDA [4]. Nowadays it has support

for two additional back-ends, in addition to supporting Python’s native interpreter as a target. Those two

other back-ends are OpenMP and TBB 4. [3].

In the following sections, we will cover briefly how Copperhead works internally, in a high level brief

description, and will show an example code and how to compile and run it. Then, we will discuss in more

details the restrictions and constraints set on the programmer’s behalf in order to utilize the concise frame-

work. After that, performance charts compared to well optimized hand crafted CUDA programs compared

to CopperHead produced ones are shown in the next chapter, Chapter 45.1Copper head is only Python 2.6 or later compatible2This is according to the only maintainer of this project nowadays, Dr. Bryan Catanzaro who also contributes to PyCUDA, in

an E-Mail message. Also, this project was started with Dr. Michael Garland.3nested parallelism is nesting parallel primitives within functions to model how each will map to different hierarchy of the target

parallel device architecture e.g. most outer function will be mapped to blocks (in a CUDA model) , inner to threads and multipledeeper nesting are sequentialized within a thread.

4Intel’s Threading Building Blocks5Also, in a separate E-Mail message to Dr. Catanzaro, he stated clearly that current version of CopperHead performs a lot better

than the original version, which means the charts on the original 2010 paper they published show the least a programmer can expectfrom Copperhead

52

3.2 The internals: How it works [4, 3]

Exposing parallelism using CopperHead is done by using a limited set of parallel primitives stated

in the CopperHead Prelude API’s documentation. Examples of such primitives are normal Python map,

reduce, sum, scan ...etc [3]. The programmer should detect any opportunity to use such primitives in order

for CopperHead to intercept and elevate such primitive from normal Python ones to parallel ones on the

target device(s) [4]. However, in order for CopperHead to come into play, a decorator ( i.e. @cu) should

decorate any function to be parallelized [4, 3]. Important to note that functions whose definitions are nested

within CopperHead functions need not be annotated by the decorator. Once such functions are detected

by Copperhead compiler, it searches for those primitives and converts them into native calls to parallel

primitives on the target devices [4]. Target devices6, are Multi-Cores CPUs, CUDA capable GPUs and

TBB-compatible CPUs [3].

After the developer has designed the parallel program using said primitives, and then compiled using

CopperHead parallelizing compiler, CopperHead generates a cache directory containing both target source

code and runnable binaries of such source code files [3, 4]. The reason why is this done is that each back-

end (i.e. target platform), has its own API’s. CopperHead relies mostly on Thrust library to generate CUDA

C compatible source of the original Python file, relies on OpenMP for generating multi-core compatible

parallel code, and relies on Intel’s TBB to generate multi-threaded code for compatible CPUs.[3, 4]. The

reason they are cached is that compiling from source to source is the slowest step in the whole process.

So for the first run, it will take long to compile but once cached, the binaries generated run competitively

with hand crafted well optimized code designed originally for the target platform. [4] We, according to our

experiments and comparing binaries generated performance, suspect that a mix of thrust functions calls is

intermixed with the use of CUDA asynchronous streams and passing device pointers to Thrust functions. So

far, we didn’t confirm such speculation but through performance comparison with the CUDA native version

of the benchmark 7.

In the following Listing, is an example of a CopperHead program that utilizes multiple target devices,

Listing 3.2.1. As you may have noticed, programs developed using CopperHead are inter-operable with

normal Python libraries [4]. However, in order to take advantage of the heterogeneous memory spaces6Also referred to as target platforms7The lack of knowledge of compilers cause this to be nearly impossible, future work may target the anatomy of CopperHead as

an inspirational source for further development in this direction.

53

auto-management of CopperHead, either arrays should be numpy arrays or these numpy arrays should be

converted to copperhead.cuarray using CopperHead provided utility functions and/or classes. [3, 4]

1 from copperhead import *2 import numpy as np3 from copperhead import *4 import numpy as np5 import timeit6 @cu7 def ident(x):8 def ident_e(xi):9 return xi

10 return map(ident_e, x)11 iters = 100012 s = 1000000013 t = np.float3214 a = np.ndarray(shape=(s,), dtype=t)15 b = cuarray(a)16 p = runtime.places.gpu017 #Optional: Send data to execution place18 b = force(b, p)19 def test_ident():20 for x in xrange(iters):21 r = ident(b)22 # Materialize result. If you don’t do this, you won’t time the23 # actual execution But rather the asynchronous function calls24 force(r, p)25 with p:26 time = timeit.timeit(’test_ident()’, \27 setup=’from __main__ import test_ident’, number=1)28 bandwidth = (2.0 * 4.0 * s * float(iters))/time/1.0e929 print(’Sustained bandwidth: %s GB/s’ % bandwidth)

Listing 3.2.1: Copperhead example benchmark implementation with illustration of how places is used tospecify the target device(s) [3]

Moving data back and forth to/from devices can either be managed by the runtime, preferable and is done

in a lazy manner, or be optionally forced by the programmer. Also, kernels (i.e. CopperHead functions) are

launched asynchronously. That is, the program main thread advances once it launches the function and

doesn’t wait or block in order to wait for results to get returned by the kernel[3]. For this reason, an explicit

access to the variable in which results should be returned must be done so that the program blocks till these

data are available[3, 4]. Forcing such behavior is also possible through copperhead force() function.

54

This is important for both timing and to guarantee that correct results are accessed rather than junk data

previously residing in memory.

If the above program in Listing 3.2.1 was stored in a file called benchmark.py, a user would run it

exactly like running a normal python program from the command line:

$ python benchmark.py

Nothing else need to be done in order to both compile and run the program. Remember that CopperHead

compiler is a source-to-source compiler. That is, python source to target platform/device source compiler.

So after invoking the above on the program, a directory called pycache is created with multiple folders

named after UUIDs 8. Each of such folders contains even several folders, the latters contain sources gener-

ated by the compiler. They are not human readable so that it can be further modified by hand, so the claim in

the 2010 publication in [4] are not exactly right anymore. This would be a great advantage if it was possible.

That is, after prototyping using CopperHead and then deciding to move on to native platforms, sources on

the native platforms are readily available based on a correct implementation witness. This simplifies the task

of a parallel program developer to start from correct implementation in order to arrive at a more optimized

version based on his/her knowledge of the problem in hand [4]. Unfortunately, it isn’t the case.

From Listing 3.2.2, and after running the benchmark script above, it is clear where and which are GPU

sources and binaries and which are the host code and binary files. In the following sections, we will explore

sources specific to GPUs output from CopperHead and will leave out other target programs since they are

similar except they use the target frameworks/libraries to generate the parallel code.8Universally Unique IDs

55

1 mahfoudh@fractus ˜/copperhead_stuff/copperhead/samples # tree -L 4 \

2 __pycache__/

3 __pycache__/

4 benchmark.py

5 ident

6 39bb3db8b548318da0f93ad1d40260e9

7 gpu.cu

8 gpu.o

9 info

10 778c2bd158be4b5fb58aec19b81004f5

11 codepy.temp.778c2bd158be4b5fb58aec19b81004f5.module.so

12 cuinfo

13 info

14 module.cpp

15 module.o

Listing 3.2.2: Sample cache directory structure after running the above benchmark

While the extra files (info, cuinfo) in both directories are CopperHead-specific files used to keep track of

some library files and/or symbols, other files are both the host code generated by CopperHead (module.cpp)

to run the GPU CUDA code (gpu.cu), both are not clear in how they map to the original Python code though.

As per CopperHead 2010 paper [4], the generated code above is directly include-able in target programs.

However, exploration in this direction proved to be difficult. It is important to note, also, that compiling the

code produced by Copperhead relies on header files from CopperHead to compile correctly. Once both host

driver code and GPU code are compiled by CopperHead, they are both runnable given that CopperHead

header files needed are appended in the path, luckily this is done once CopperHead is installed on the

system it is running from. A sample output GPU source for the above benchmark is shown in Listing 3.2.3

and its host code to launch it is shown in Listing 3.2.4.

56

1 #include <prelude/prelude.h>

2 #include <prelude/runtime/cunp.hpp>

3 #include <prelude/runtime/make_cuarray.hpp>

4 #include <prelude/runtime/make_sequence.hpp>

5 #include <prelude/runtime/tuple_utilities.hpp>

6 using namespace copperhead;

7 #include "prelude/primitives/map.h"

8 namespace _ident_9348976088365643762 {

9 template<typename a>

10 __device__ a _ident_e(a _xi) { typedef a T_xi; return _xi;}

11 template<typename a>

12 struct fn_ident_e {

13 typedef a result_type;

14 __device__ a operator()(a _xi) {typedef a T_xi;

15 return _ident_e(_xi);}};

16 sp_cuarray _ident(sp_cuarray ary_x) {

17 typedef sp_cuarray Tary_x;

18 typedef sequence<cuda_tag, float> T_x;

19 T_x _x = make_sequence<sequence<cuda_tag, float> >

20 (ary_x, cuda_tag(), false);

21 typedef transformed_sequence<fn_ident_e<float>,

22 thrust::tuple<T_x> > Tresult;

23 Tresult result = map1(fn_ident_e<float >(), _x);

24 typedef sp_cuarray Tarycompresult;

25 Tarycompresult arycompresult = phase_boundary(result);

26 typedef sequence<cuda_tag, float> Tcompresult;

27 return arycompresult;

28 }}

Listing 3.2.3: GPU src output from the CopperHead program above

57

1 #define BOOST_PYTHON_MAX_ARITY 10

2 #include <boost/python.hpp>

3 #include <prelude/runtime/cunp.hpp>

4 #include <prelude/runtime/cuarray.hpp>

5 using namespace copperhead;

6 #include <cuda.h>

7 namespace _ident_9348976088365643762 {

8 sp_cuarray _ident(sp_cuarray ary_x);

9 }

10 using namespace _ident_9348976088365643762;

11 BOOST_PYTHON_MODULE(module)

12 {

13 boost::python::def("_ident", &_ident);

14 }

Listing 3.2.4: Host src output from the Copperhead program above to run the above GPU program in Listing

3.2.3

3.2.1 Compiler Architecture Abstraction [4, 3]

CopperHead Compiler is composed of three major parts: the front end, the mid section, and the back-

ends. The front end of the compiler parses and transforms the program in preparation for generating a

schedule, inferring most generic types for functions in a Hindley-Milner style. The mid section analyzes the

program AST9 and produces a schedule. A set of back-ends are responsible for generated sources that are

directly compilable and executable on the target platforms, i.e. CUDA C for the places.gpu0, OpenMP

for the places.openmp, and/or Intel TBB-compatible CPUs for places.tbb. There is one place left

not mentioned but we happen to find it in one of the sample codes, it is places.here. This is where

code is executed natively on Python’s interpreter just like any other Python sequential program. [4, 3] This

is essential for early code debugging before deploying to the target platforms using any Python debugger.

Between the step of parsing the program (the fornt-end) and then producing the output source program9Abstract Syntax Tree

58

(the back-end), the mid section of the compiler analyzes the program in order to produce a schedule, syn-

chronization points and extents of data structures and which functions to run first and which to run later than

others. This is to obtain maximum performance yet correct results [4].

3.3 Restrictions, Constraints and Language Specifications [4]

According to CopperHead paper[4], there are constraints and specifications that need be complied with

by the developer when programming CopperHead functions, or any of their nested functions. This seems to

be simple in the first look, but when we experimented with it, it was a source of a lot of problems especially

when several steps are needed to build the final value of a certain variable. We ended up numbering the

same variable name several times and the taking the previous values as read-only to construct the remaining

parts of a complex calculation. This caused the code to lose readability and the productivity is reduced

due to increased lines of code compared to normal Python code. The following is a list of constraints and

specifications set by the framework:

Conditionals: Any branch of any conditional/execution-path inside any copperhead function has to

return, i.e. by calling return keyword.

Immutable variables: Any variable once bound to a value it is immutable henceforth; for example,

result = 1* 2, will be bound, not assigned like regular python, to that value which is 2 and will not be

able to change it in any way. It can however be used as a read-only value in expressions.

Functions have no effects: that is functions should not interact in an observable manner with the en-

closing functions except on the data passed to them in order to get processed.

Static Typing: all types must be statically typed and hence processed by CopperHead compiler and no

dynamic types are allowed innside CopperHead code. An example is x = np.arange(0, length,

dtype=np.float32), note the type is stated explicitly and assigned statically to all elements of the

array.

Tuples: tuples are accessed by using the binding form, indexing over tuples is not supported by Cop-

perhead, please refer to Listing 3.3.1 for clarification assuming that the example code lies inside a

Copperhead procedure.

59

1 myTuple = (one,two, three)

2 ...

3 one_var,two_var,three_var = myTuple

4 #Invalid tuple indixing follows

5 # one_var=myTuple[0]

6 # two_var=myTuple[1]

7 # three_var=myTuple[2]

Listing 3.3.1: valid and invalid examples of tuple binding/access inside a Copperhead procedure

Important to mention that all valid Python programs are also valid CopperHead programs given that they

conform to the above constraints and specification stated in the next section.

3.3.1 Language Specifications [4]

These specifications are completely based on what is stated in the CopperHead 2010 paper [4]. They

still apply for newer versions of the framework. Symbols in language specs follow:

E = expression

S = statement, lower case letters for identifiers

F = Function-valued espression

A = Array-valued expression.

The following is the language specifications with symbols used as stated above:

Expressions: E : x | (E1, ..., En) | A[E] | True | False | integer | floatnumber; which means that

accepted expressions in copperhead embedded language are: identifiers, any series of expressions, Array

accesses, the boolean true and false values, and integer or float numbers.

Arithmetic operators allowed: all that are allowed in python i.e. E : E1 + E2 | E1 < E2 | ...

Flow control and logical operators: Python’s conditionals are allowed as well as keywords like

not, and, or; that is E : not E | E1 and E2 | E1 or E2 | E1 if Eb else E2

Functions definitions: Only functions and lambda’s with positional arguments10;10i.e. that don’t have the “var=value” form; much similar to Java functions’ arguments, without the type declaration.

60

i.e. E : F (E1, E2, ..., En)

| lambda x1, ... , xn.

List Comprehension subset allowed, and the main primitive of Copperhead: E : map(F,A1, ..., An)

| [E for x in A] | E for x1, ..., xn in zip(A1, ..., An)

Statements: The body of any Copperhead procedure is a suite 11 of statements. Each statement must

be a valid expression of the following form: S : return E | x1, ..., xn = E | if E : suite else: suite

| def f(x1, ..., x2) : suite; The second form is basically tuple binding/unboxing we just saw in the

constraints subsection before.

Additional notes: CopperHead programs must be valid programs regardless of re-ordering. Both re-

ordering and immutability of vriables allow CopperHead to re-order and do transformations that lead to

a more efficient code. The only ordering guaranteed by CopperHead is those imposed by data dependen-

cies, hence simplifying the worry about synchronization points from developer’s minds. CopperHead

does a series of transformations in order to determine synchronization points, inserts as few of them a

possible to maintain performance and/or determines execution schedule of parallel primitives generated

to target code. It primarily relies on barrier synchronization since it assumes SIMD12 target devices. [4]

3.3.2 How it determines Synchronization points [4]

In CopperHead paper [4] , they call ”x[i] complete when its value has been determined and may be used

as input to subsequent operations. The array x is complete when all its values are complete.” It follows that

incomplete means any output whose at least one of its elements can’t be guaranteed to be incomplete

The way to determine whether synchronization is needed or not between different parallel primitives is

by classifying all primitives into two main classes. Each primitive declares itself as belonging to one. Those

parallel primitives that do induction over the domain of their inputs , i.e. permute, scatter and their

variants, are not guaranteed that their output is complete until the entire operation has finished. So, their

output is either entirely complete or entirely incomplete. The latter class requires synchronization to get their

correct output. The second class is composed of those primitives that operate by induction on the domain

of their outputs, i.e. all other primitives like map, gather, scan and their variants. The second class, requires11That is a sequence or nested sequences12Single Instruction Multiple Data, data parallel that is

61

more analysis in order to determine whether synchronization is needed or not.

In the latter class, i.e. doing induction over the domain of their outputs, are further classified into three

cases according to the portion of each yj13 that must be complete in order to compute xi[4]14:

1. Local: completing x[i] requires yj [i] be complete. [4]

2. global: completing x[i] requires yj be entirely complete.[4]

3. none: doesn’t depend on any element of yj[4]

This is also a limitation to CopperHead as per the 2010 paper[4] since further semantics analysis is

needed to explore knowledge about info in the spectrum between local and global cases. [4]

3.3.3 Shape analysis [4]

Shape analysis is the process of determining the extents, i.e. size, and types of data involved in the

parallel computation and storage needs of their intermediate values. This is necessary in order to help the

back-end of the compiler, essentially the target device/platform, to decide statically about memory allocation

needs and, as a result, to enable more efficient and higher performance target code to be generated.

This is done by all parallel primitives that should declare the ”shape” of its input and the ”shape” of its

output through providing a shape-function that maps input shape to output shape. Specifications are stated

all in tuples, all internally by the CopperHead compiler and runtime, and requires no input from the user in

this aspect other than using the primitives itself and of course while conforming to all language specification

and constraints stated earlier. Also, not every shape analysis yields a result and hence shape analysis by

itself is a limitation of CopperHead to produce a better, more efficient and higher performing code.

CopperHead does a series of transformations amongst which are transformations done to determine

synchronization points in order to produce a best effort efficient code that has highest possible achievable

performance by code generated.13Their inputs14Their output(s)

62

3.4 Performance and charts [4]

Since its implementation has changed a lot from the original published work in [4], we did some bench-

marking for an included blackscholes sample 15 against its counter part CUDA C Runtime API implemen-

tation. This is to gauge the performance of CopperHead output source code compared to a manually hand-

crafted implementation. Both implementations are single GPU enabled. The system used to test them has a

GTX 440 with 1GB of DDR3 global memory for our tests and will be shown in the next chapter, Chapter

4. Samples ranged between 4 Thousands sample calculations to 32 Millions, respectively their sizes ranged

from several kilobytes all the way to around 2 Gigabytes to expose memory transfers latency. The huge sizes

of samples are for gauging how much acceleration CopperHead gains compared to a streamed 16 version of

the benchmark.

15Due to the fact that we couldn’t complete our implementation caused errors emitted by CopperHead compiler that we have noreference to fix. This actually a limitation of this framework emitting completely non-sense error messages and stack traces for thesimplest mistake a programmer does that is normally not traceable to original code.

16That is uses asynchronous streams with small enough transfers to buffer just enough data to the GPU streaming multiprocessorsin order to keep them busy enough to transfer another sub-set of the overall workload. This is to overlap computation and transfersand transfers with other transfers

63

Chapter 4

Black-Scholes Benchmark: The Exercise

64

4.1 Introduction

Initially, the goal of this work was to port as many benchmarks from the PARSEC benchmarks suite as

possible. This is in an effort to recognize parallel programming patterns in order to document them to ease

parallel programmers and HPC specialists task. In a maturation process of the original goal, we recalled

that parallel primitives and/or patterns are already discovered. Moreover, advanced technologies like the

evaluated frameworks, tool-chains, parallelizing compilers and auto tuners existed. So, we decided, instead

of re-inventing the wheel, to take advantage to propose the best possible way we could come up with to fast

prototype Parallel and HPC applications to make it more adoptable for mainstream programmers. The first

benchmark from the original proposal to port to the GPU architecture was the Blackscholes. It happened

that it was an embarrassingly parallel benchmark that we didn’t realize until we are almost done porting it.

As a result, we decided to make it our test benchmark for evaluating the tools in hand since it doesn’t impose

a limitation by itself on any.

The goal, after that, was not to tackle a hard problem in order to port to GPU, but it is merely an exercise

to illustrate the issues to be considered during the evaluation of prototyping using one or more of the parallel

programming frameworks at hand. Also, techniques of how to optimize parallel programs on different

devices has been visited and revisited in Kirk et. al [14], Wen-mei et al. [24] and other publications. Another

aim was to come up with a rapid prototyping work-flow for parallel and HPC programs based on some of

the frameworks being examined during this exercise. Blackscholes was specifically kept as our benchmark

of choice because it is easy to port, parallelize (embarrassingly parallel) while keeping focused on the task

in hand (assessing the use of streams, work distribution between two or more GPUs and two or more CPUs

or combination of both. Furthermore, it will be easier to extend to multiple host machines (nodes) using e.g.

MPI, MPI for Python in order to assess the success/failure of proposed work-flow. Regardless of that, we

managed to only benchmark on a single host and single GPUs due to limitation of some of the frameworks,

i.e. CopperHead supporting only one GPU per node. Blackscholes PARSEC implementation was elected

over all other implementations due to many reasons, one of them so that the exercise contributes to the

currently prevalent benchmark of CPU world in order to provide some GPU support. Another reason is the

clarity pf procedural and incremental approach it uses to calculate its final results aids to better understand

the problem being addressed by the benchmark. This also makes it easier to pin point parallel parts of it. A

final reason why PARSEC implementation was chosen is that, at the end, the product of this exercise (CUDA

65

C version of Blackscholes) can be contributed and integrated seamlessly with PARSEC suite/framework and

modified easily since it uses same input data format(s), very similar output, and has modular architecture

that is easier to modify and/or extend to multiple GPUs and even a GPU cluster. Only a single GPU version

was developed using CUDA C, while extending it to multi-cores and multi-MPI processes using the rapid

prototyping framework(s) of choice after evaluation to show their effectiveness (practical example that is),

was left out since it is practically trivial.

4.1.1 The reason(s) behind porting this benchmark

As we mentioned in the previous subsection that Blackscholes is an embarrassingly parallel, allows

to focus on streams assessment and other aspects of GPU computing. There is theoretically no limit of

scalability and no synchronization constraints, but hardware availability, that could have been imposed by

this benchmark on any specific device and/or technology under investigation. [7, 8, 15, 14] It easily can

be made modular for easy experimenting and parameters variations to evaluate multiple aspects of CUDA

hardware specific features/limitations/inconsistencies amongst different device capabilities. This, indeed,

allowed the development of an adaptive version that takes most benefit of both maximum shared memory

available to a block, threads registers and number of streaming multiprocessors. Most importantly, we

wanted to gauge LOCs1 and time taken to write it compared to the framework(s) of choice implementation

exercised (productivity perspective, early indication of performance scalability, and rough numbers). In

order to have preliminary sense and assessment of how things are similar/different between CUDA and other

framework(s), we needed the simplest and most flexible application to give us room to play with different

parameters and approaches to achieve best performance and to exercise most technologies provided by each

technology. This is in preparation for rapid prototyping of previously unknown potential applications and

to gain more thorough application development experience in order to use the work-flow in larger and more

serious projects in the future.

4.1.2 Previous implementations/references

There are a couple of previous implementations that aided in the understanding of calculations of differ-

ent types of stocks. Following is a list from which we picked PARSEC’s for several reasons.1short for ”Lines of Code”

66

• nVidia implementation: it aided in understanding the CNDF2 function used in blackscholes European

prediction of stocks/assets call and put prices.

• PARSEC CPU implementation[7, 8] aided at the overall understanding of the computation, also clearer

CNDF function implementation, and has a prepared valid set of input values that eases test of the pro-

gram on multiple sizes of data. The input bundled with the application developed here comes completely

from PARSEC[8, 7].

• CopperHead preliminary implementation [3] : There is one implementation of the same application that

is almost identical to nVidia’s simplistic implementation except it is done in CopperHead. However,

the reason why we elected not to use it is that it doesn’t have a controlled means of inputting the same

values provided by PARSEC [7, 8] into the computation kernel and the kernel isn’t exactly compatible

with the input structure.

• There was a surprising number of implementations yet the goal is to benefit PARSEC from this exercise

since it lacks GPU benchmarks and to easily assess different aspects of frameworks evaluated for rapid

prototyping purposes by using more modular approach to implementing the application.

That being said, we ended up using CopperHead version for benchmarking since we hit a delima we

couldn’t overcome. Thanks to the poor error reporting and documentation in the CopperHead implementa-

tion. Documentation of why things went wrong and/or how to possibly fix them wasn’t in any way indicated

to the end user, the programmer. Even more, it is not at all trace-able to any of the variables/functions or

even a specific place at the original Python+CopperHead code. This is why, after spending all the time

allocated for this exercise, that there must be a way to gauge CopperHead performance compared to other

frameworks. So we picked it to show some realistic numbers instead of leaving it vacant.

4.1.3 Suitability for GPU/CPU acceleration

This specific application is one of the most suitable applications for parallelization on any platform,

following is a set of some of the reasons why:

• embarrassingly parallel single dimension problem (an array of input-structs and an array of output

floating point numbers), note that floating point precision is ignored in this experiment too. The use2Commulative Normal Distribution Function

67

of float or double isn’t taken in consideration during experimentation of frameworks except for

whether they support these types or not (as a feature only).

• some other parallel algorithms and/or atomic operations depend on newer hardware while the hardware

available on our development machines may support them, if they don’t support such modern features

then it may bring the whole project to a halt. Which is faced in previous projects and hence we are more

cautious about this fact and about the fact that it is not of a big concern when evaluating our subjects.

• variety of implementations in case failed in one, refer to other without losing much time and effort.

• good to focus on aspects of frameworks rather than distracted by the computational problem-specific

details.

4.1.4 Blackscholes operation abstraction: Overview of how the algorithm works

The overall application operation is very straight forward. While the great details are left out of this

report since it is covered and very well documented in the source code, the following are the major abstract

steps taken by the program to perform its purpose:

1. Read input from a file filled with input data in specific format/layout[7, 8] parameters (mostly numbers)

and store in structs/tuples for each set of parameters.

2. Do work distribution and mapping to functions, device(s), blocks, and threads.

3. Transfer the input data to heterogeneous memory spaces to prepare for parallel processing.

4. Do computation on heterogeneous devices using parallel frameworks applicable e.g. CUDA C on GPUs,

OpenMP on multi-cores, MPI on multi-nodes clusters of machines.

5. After computation is done on all heterogeneous devices, transfer from heterogeneous memory spaces

back to host memory in preparation for post processing and reporting.

6. report results and timings of computation(s) done.

68

4.1.5 Mathematical background[7, 8]

Blackscholes is a mathematical prediction model for assets/stocks prices[8]. It is used to predict and/or

approximate stocks prices when some of many parameters of the asset/stock price change(s) [8]. It’s im-

portance stems from the fact that so many calculations per time unit need be calculated before making a

financial decision regarding stocks in question. The benchmark was created since it models a real world

re-occurring problem and still needs be parallelized on the GPU to take benefit from the processing power

instead of relying on a more expensive less-performing CPUs.

Blackscholes computation relies on multiple formulas and parameters explained below:

S: current stock price

X: the strike price

r: the risk-free interest rate

v: the volatility of the stock

T: Time as a fraction of a single year e.g. 0.5 years.

otype: the option type accepts two values: P (for put) or C( for call). This parameter determines which

formula from the below ones is used.

Two Derivatives: whose values are computed as

d1 =log( S

X)+(r+ v2

2)T

v√T

and d2 =log( S

X)+(r− v2

2)T

v√T

Stock price formulae: according to the Option Type one formula is chosen.

pricecall = S × CND(d1)−X × e−rT × CND(d2)

and priceput = X × e−rT × CND(−d2)− S × CND(−d1)

CND/CNDF : a function that calculates the Cumulative Normal Distribution for a given derivative with

respect to time to be used in the stocks price calculation; where CND(-d) = 1− (CND(d))

In brief, the algorithm reads sets of above parameters( i.e. S, X, r, v, T, otype) from an input file. Then,

the application transfers the read value as an array of tuples OR a tuple of arrays. After that, it does the

69

computation according to above stated functions and formula chosen based on otype. Finally, the results

are stored on an output array as priceput and pricecall mix of values to be transferred from heterogeneous

memory spaces back to the host for writing the output to another file.

4.2 CUDA C Implementation (similar frameworks apply)

In this section, we will go through all the needed activity of developing for the GPU. Other similar

frameworks like PyCUDA [1, 2] and JCuda [22] use similar (almost identical)3 style of work-flow. Also,

other parallel processing frameworks, i.e. posix threads, MPI and OpenMP are left out since there are

enough resources, references and examples out there about how to achieve parallelism using them.

4.2.1 Mapping the algorithm to GPU blocks and threads

Choosing an embarrassingly parallel problem is the reason why this step, mapping the problem into

threads and blocks, is easy [15, 14, 18]. In a GPU compute model, i.e. SIMT 4, communication and data

dependencies between blocks of threads as well as between different devices and between different kernels,

is the source of performance drops and/or a lot of parallel programs complications [15, 14].

The high level diagram below, Figure 4.1, illustrates the steps needed be done by CUDA C programs in

general to enable parallel processing on such devices and in specific illustrates how Blackscholes is done.

The diagram assumes that the input test data are already loaded in the host memory, this is to keep it simpler

to understand.

Pinned memory allocates and locks the allocated memory 5 to the GPU so that we enable the use of

streams. This is to overlap computation and data transfer and data transfer with other data transfers. That is

achieving some kind of asynchronicity. Pinned memory is not page able. That is, that amount of memory

is reserved to be used exclusively by the GPU. By doing that, the GPU issues its request directly to access

that memory block, without the need for the host to issue that request explicitly. This is needed to enable

asynchronous memory transfers using asynchronous streams to achieve higher performance by overlapping

data transfers with computation. The way to do this is out of the scope of this work and so it is discussed

only in the context of thrust vs manual CUDA C development.3Only different in the way they do it in a little more object oriented way4Single Instruction Multiple Threads, or referred to as data parallel like SIMD5making it available only for GPU device

70

We need to define some glossary to be able to explain more about data transfer overlapping with com-

putation:

• Stream: a series of instructions executed in-order as issued by host code, instructions from different

streams are interleaved and can be executed concurrently. [16] They are analogous to work-queues

where the consumer process picks randomly and only on availability of resources on the GPU device.

• default stream (also called null stream): this is different since no operation issued in this stream will

execute till all previously issued operations in all other streams are executed/completed and an operation

issued by default stream has to complete before any operation from any other stream is executed. In

other words, this is a synchronizing stream. [16]

It is stated and tried in [16] that re-ordering the issuance of kernel launch vs data transfers and mix

matching them yields different performance results that are largely inconsistent from one device to the other.

This behavior of streams on different devices inconsistency is due to each device may have more or fewer

data transfer engines. Older devices have only one engine that does Host-to-Device and Device-to-Host

memory transfers and another for kernel launches. As a result, data traveling on different directions can not

utilize asynchronous data transfers to increase performance. Newer devices however do have 2 engines for

data transfers which enable such transfers overlap and one kernel engine to overlap computation with data

transfers. According to Mark Harris [16], this inconsistency in the launch and data transfer ordering will

have consistent results on devices of capability 3.5 or later since Hyper-Q feature removes this inconsistency

no matter what the order of launch versus transfer was. So, the use of streams especially in libraries like

Thrust, which is mostly based on the default stream, can utilize such feature without the fear of inconsistent

performance results. This implies that Thrust, when utilizing asynchronous streams and data transfer over-

lapping with computation and then synchronizing based on events, will be future proof without the need

for tailoring device specific tuning and optimization from the programmers’ side. streams are used in this

exercise but explored thoroughly and better covered in [16, 18, 15], please see 4.2.3 for more info. Also,

since Thrust can, and may, use asynchronous streams in the future to improve its algorithms’ and primitives’

performance, they are left out of this work and we will assume Thrust will do it in future releases.

71

4.2.2 Trade offs considered

During the time spent in understanding and collecting information about the benchmark, several trade

offs came across. First one was that constants in the CNDF6 reduced to precision provided by PARSEC

implementation to be more compatible with their CPU version. As a result, we had to pick a lesser precision

calculation to make it more compatible. We had several implementations as a start but we decided to pick the

less elegant PARSEC CNDF function for validating input vs output using PARSEC’s CPU implementation.

PARSEC [8, 7] implementation is more tricky to implement in the targeted rapid prototyping frameworks

of choice, covered earlier in Chapter 3. It exposes restrictions and constraints imposed by CopperHead.

It is more straight forward to divide into modules when using cuda, i.e. to several device functions that

will compose the overall computation and kernel. More over, it can be mapped right away to Thrust’s

”transform” function just as is and without a single re-touch.

4.2.3 Synchronization and communication between streams

Implementing an embarrassingly parallel benchmark almost has no synchronization except when using

streams, which proved tricky in the practical exercise. However, still streams can be synchronized selectively

between them selves, using events.[16] More over, threads don’t create neither bank conflicts on the shared

memory, because we are using strides to let each access exclusive set of data to process in parallel, nor they

require any synchronization during accessing the global memory to load and write their results. The later

is due to using strides again but focusing on exclusive ranges for each block and using memory coalesced

accesses to preserve performance and to avoid data races between threads that could occur due to some

bounds miscalculations 7.

4.3 CUDA C version

We first ported the CPU PARSEC[8] to the GPU using CUDA C Runtime API’s to see how much

performance we can achieve with all efforts we put into this implementation. Also, we did this to check

how much time and lines of code it takes to produce such simple benchmark using said API’s. This is

in preparation to compare the productivity vs performance achieved and understanding from a prototype6Commulative Normal Distribution Function7Data races are hardest bug to diagnose and fix so we tried to avoid them as much as possible in every step from the implemen-

tation

72

done in our elected prototyping frameworks, i.e. CopperHead and Thrust. We don’t expect to achieve

higher performance that nVidia’s implementation but we will compare our implementation’s in terms of

performance to see how much performance achieved in an easier way than CUDA C runtime API.

4.4 Copperhead Version

We tried to make our implementation of the CopperHead version of Blackscholes benchmark as closely

resembling that of CUDA C as possible. We, though, failed to come up with a working version due to hitting

time limit allocated for this specific exercise. This is mainly due to lack of support and resource and lack of

meaningful error messages returned by CopperHead compiler. Also, the sparse documentation of Copper-

Head and how to utilize it makes it difficult to predict what went wrong. Errors and exceptions encountered

took weeks to predict what went wrong. As a result, we recommend against going the copperhead direction

as this may take too long to come up with a working example/prototype looking how little support a pro-

grammer can get to make things work. So we decided we should come up with a Thrust version and desert

the python path since we don’t have a working prototype based on CopperHead. Also, we think PyCUDA

isn’t the best platform to use for optimized version of the prototype even if we had a working one. This is

not to mention that CopperHead version of any prototype would be a throw-away prototype, and we want to

save time like any other programmers.

4.5 Thrust Version

In this version, we are using Thrust-only primitives to re-implement the benchmark and will only op-

timize using said framework’s primitives. We, also, are using CUDA C version implementation parts for

reading input files and loading them into class objects/structs to be readied for processing following best

practices advised by Thrust developers/documentation. Also, we kept different versions of this benchmark

to be as closely similar to each other as possible to ease optimizations and comparisons.

4.6 Comparison

We set our baseline for performance comparison to be the hand crafted NVidia implementation of

Blackscholes in CUDA C. We first compare its performance to our CUDA C performance, then we compare

73

its performance to Thrust version of blackscholes. We then summarize our findings to rate how successful

was the Thrust version of blackscholes in terms of performance and time saving to come up with a working

program.

4.6.1 Performance and Productivity comparisons

Shown below is a chart, Figure 4.2 and Figure 4.3 comparing performance of multiple versions of

blackscholes ported and kept as close as possible to each other but using different frameworks. The perfor-

mance comparison is followed by a rough productivity comparison based on 2 measures, lines of code and

time taken to implement the benchmark in the target framework; Please refer to Table 4.2. All benchmarks

are done using an NVidia GTX 440 card with 1 GB of DDR3 buffer memory. Its capability is 2.1 so when

compiled it was optimizes accordingly.

0'

10'

20'

30'

40'

50'

60'

70'

4k' 16k' 32k' 64k' 128k' 256k' 512k'

4096' 16384' 32768' 65536' 131072' 262144' 524288'

Time%(m

s)%

Number%of%Stocks%Calculated%(Samples)%

Performance%Per%Implementa;on%(Lower%is%be>er)%

CopperHead'(Converted'to'ms)'

thrust8gpu'(ms)'

cuda'(ms)'

cuda+streams+async'(ms)'

Figure 4.2: Performance Comparison of multiple implementations each done in a target framework, smaller

data sets

74

0'

500'

1000'

1500'

2000'

2500'

1M' 2M' 4M' 8M' 16M' 32M'

1048576' 2097152' 4,194,304' 8,388,608' 16,777,216' 33,554,432'

Time%(m

s)%

Number%of%Stocks%Calculated%(Samples)%

Performance%Per%Implementa;on%(Lower%is%be>er)%

CopperHead'(Converted'to'ms)'

thrust8gpu'(ms)'

cuda'(ms)'

cuda+streams+async'(ms)'

Figure 4.3: Performance Comparison of multiple implementations each done in a target framework, larger

data sets

Clearly can be seen that CopperHead performed the worst compared to all other implementations in

smaller sample sizes except when the sample size is 1 million. This is when it was significantly faster than

Thrust. This is where optimization of Thrust version comes into play using CUDA C to again become much

faster than CopperHead version. We assume that with larger data sample sizes only CUDA C versions and

specifically the one optimized with streams and asynchronous transfers is going to hold up being a high

performer compared to the slower CopperHead and Thrust versions. It is unclear how Thrust manages data

transfers so we speculate based on the chart that the larger the data size, the slower Thrust is due to the

managed data transfers is done using synchronous transfers. However, we also noticed after profiling Thrust

version using nvprof 8 that there is a cudaCreateEvent() call that takes around 500 ms to complete.

This was possible for Thrust , where in the case of CopperHead this is probably impossible to know. This

adds to CopperHead’s disadvantage. Charts speak for themselves and the reader is encourages to compare

different versions of the benchmark visually based on that. Important to note that our test bed which had

only 4GB of ram couldn’t run the CUDA streamed version on the largest data set due to exceeding the host

memory limits. However, when run on another host that had 48 GB of RAM and a GTX 480, it performed8nVidia Profiler, command line

75

as fast as when it was run on the first host but with smaller data set (i.e. it computed 32 Millions of option

prices in around 163 Milliseconds, similar to the other host/gpu but on a data set that is one fourth the size).

In other words, it performed around 4 times as fast and didn’t crash.

To show clearer comparison since the chart isn’t able to reflect CUDA C performance numbers, a table

for precise times taken to compute similar samples sizes in different implementations, shown in Table 4.1.

Please note that the table shows an additional CUDA C version that was manually optimized to obtain best

performance. Since cuda versions are made adaptive based on device parameters, we see that it reflects on

performance especially on the streamed version, we tried both on another graphics card and it did signif-

icantly faster on GTX 480 compared to GTX 440 which confirms that it is indeed scaling well based on

device parameters.

Sample Size HumanReadable

CopperHead Thrust CUDA C CUDA Cstreams +Async

4096 4k 44.88587379 0.724384 0.691744 0.20588816384 16k 45.41301727 1.3857 1.26854 0.34710432768 32k 46.18692398 2.87078 1.91962 0.52614465536 64k 46.37241364 5.06429 3.09114 0.976672131072 128k 47.52731323 9.25776 5.46304 2.87744262144 256k 51.02586746 18.0697 9.79046 3.65136524288 512k 57.47961998 34.5461 17.9805 8.17351048576 1M 72.02792168 66.9648 34.3157 15.22452097152 2M 100.7580757 132.897 67.0258 30.77644,194,304 4M 102.6661396 270.528 132.312 63.72018,388,608 8M 239.0007973 530.885 264.016 160.98916,777,216 16M 432.7657223 1105.58 526.818 355.87333,554,432 32M 683.4483147 2247.18 1055.06 N/A

Table 4.1: Time taken to complete calculations per implementation

It is obvious that CopperHead version performed the worst in both productivity and performance wise

in smaller data sets upto 1 Million options. However, it did impressively faster than Thrust starting with a

sample size of 2 Millions and even faster than heavily optimized adaptive version of CUDA C starting from

sample size of 4 Million stocks. It is still to be seen how it performs against a streamed version of CUDA

implementation once it is available (SOON).

The reason for poor productivity of CopperHead was due to scarce documentation and support info

for how to deal with different issues faced, for the cryptic error messages and long stack traces that reflect

76

Measure CopperHead Thrust CUDA C CUDA C +streams

Lines of Code 322 (60 samplecode provided wthcopperhead)

560 1080 1248

Time Taken 3-4 weeks (and notdone)

2.5 hrs (fromscratch to com-pletely bug free)

around a month around 1.5 months(counting baseCUDA C version

Table 4.2: Lines of code required for a framework-specific implementation of the benchmark and roughtime estimation of how long it took to complete and debug it.

nothing from the programmer’s code. It is, hence, difficult to relate such errors to any thing that makes sense

to end users/developers. Nevertheless, CopperHead as a throw-away prototype is the strongest performance

gain indication for larger data sets while thrust is a better performance gain on smaller ones.

4.7 Failure case(s) and solution(s)

The only case where the benchmark may fail is by choosing a large enough sample that the sample data

as well as results won’t fit in the host’s main memory. Otherwise, even if a sample size say of 2GB like

the 32 Millions samples, the benchmark will adapt and compute one Global memory full of data/results at

a time and then buffer results back to the host and then buffering another maximal set of samples to the

device and continue in a loop. In the streamed version of CUDA implementation of the benchmark, this

is streamed. That is, using adaptive calculations based on the device parameters, the benchmark allocates

maximum possible per-block shared memory, then multiplies that by the number of multiprocessors and

then streams it to the device followed by a kernel launch. This keeps multiprocessors busy enough till there

is another set of data to process transferred already to the device, hence using full bandwidth and throughput

of the device all the time with minimal delay for transfers.

4.8 Summary

In this chapter, we went through a complete development cycle on each of the chosen frameworks to as-

sess their effectiveness in parallelizing code on a specific target device, the GPU. Abstraction of the overall

operations done by the parallel application, reasons why this benchmark was chosen, mathematical back-

77

ground for calculations involved, and comparison of each version performance were covered both running

time and productivity performance.

4.8.1 What has been learned?

Doing this exercise of porting the same benchmark using multiple frameworks, revealed several lessons.

Below is a list of the most significant lessons learned from doing this exercise.

We learned from porting the benchmark that less dependencies means more modular and more scalable

parallel application. Most bugs are from ”indices calculations” to ”flock”9 threads and blocks towards/-

away from elements to compute based on some attributes. Sorting may lessen the impact of divergence

(trade off between size of data to sort vs less divergence). Time spent to develop simplest type of parallel

application(s), e.g. blackscholes, can take significant amount of time and resources/effort to make it work

(including number of lines of code to be dealt with and memory management issues in addition to other

details and language specific issues). It is hard to tell the worthiness of parallelizing a fragment of code

in terms of performance gains, especially if it is not as trivial as blackscholes is, so rapid-prototyping is

indeed needed. Using a productivity language doesn’t necessarily translate to faster development cycle and

more certain results. An example would be CopperHead that is very easy to use, however due to its limited

documentation and rarely meaningful error messages, it made it impossible to port even the simplest bench-

mark. Using Thrust is easier than using Python-based frameworks but produces a much slower performing

programs for parallel devices especially when run on larger data sets; which translates to the fact that using

efficiency languages doesn’t always translate to better performance. Once a working model, i.e. thrust in

our case, is used as a prototype that is ready and running, it is much easier to begin from that to come up

with an optimized version of the program using CUDA C. However, this fact doesn’t deny that so many

bugs are still being faced due to indices calculations, bounds calculations, streaming data and taking care of

data dependencies and synchronization bugs still arise. A prototype not only gives confidence and feedback

about the program in hand, but it also is a huge time saver and a way of profiling which parts of a program

are performing the worst and needs be optimized, this is applicable to Thrust. This is to focus optimization

efforts on ”real” bottlenecks instead of shooting in the dark based on speculation.

Regardless of how many frameworks and approaches are out there to generate efficient parallel code,

manually crafted code using thrust and then optimizing leads to best performance gains. However, Copper-9To flock threads means to steer them in warps towards/away from data array indices to be accessed

78

Head for example showed that highly similar level of performance is obtainable based on code synthesis

techniques and without all the hassles involved in using lower level API(s). The cheer majority of paral-

lelizing frameworks/tools still don’t cut it as the manually crafted code does, this is especially true from a

productivity point of view while performance gains are competent from some frameworks. The later proves

that compiler-based techniques still can contribute a lot more than a proficient programmer is capable of and

that compiler-based optimizations can do a great deal of work involved and provide better performance than

many developers can.

4.8.2 What was expected and how it differs/similar to expectations?

We had many expectations that we thought would apply before conducting the experiments but majority

of them were proved wrong after seeing the results. We expected that using CopperHead to come up with

a parallel version of an application will be easiest compared to C/C++, we were plain wrong. The amount

of constraints and the scarcity of documentation makes it nearly impossible to make the simplest form of

parallel programs to even work. We expected that CopperHead programs are extensible to multiple GPUs on

a single host and then to even more nodes using MPI for Python, we were partially wrong. CopperHead use

of places.gpu0 was deceiving for us to think there would be places.gpu1 and more. We, however,

were correct on using PyCUDA to detect number of GPUs available to aid CopperHead extend to multiple

GPUs in case support for more GPUs was added to CopperHead in future versions. We, also, expected that

we may use the source output by CopperHead in out C/C++ code to start with but we were wrong. The paper

published in 2010 about CopperHead [4] isn’t representing current version of it. It differed significantly and

hence, its output code is not human modifiable nor it is usable based on our effort to compile and run it.

May be it is usable but our trials in compiling it alone just to plug it somehow in a C/C++ program have

failed. One reason was that the host code was completely written in python, which means that we will have

to re-write that host code in C/C++ to use the generated kernel(s). That may be acceptable when the effort

involved in developing the kernels is significantly larger than developing the host code. Nevertheless, stating

it again, we couldn’t even compile that kernel code alone in the first place. We expected Thrust may be more

difficult than using PyCUDA and/or CopperHead, this is wrong. Thrust is much easier and more flexible,

customizable in a perfectly structured way. Furthermore, it provides a lot more than what is provided by

other frameworks from integrability with CUDA C and the Runtime API, profiling and debugging tools

support and the abstraction level from all lower level API details. We expected CopperHead runnables will

79

be perform nearly equally to Thrust’s since it heavily uses thrust constructs to generate its code, we were

wrong. Thrust is much slower. Our speculation is that CopperHead combines not only some transformations

techniques to achieve higher performance on larger data sets, but also uses asynchronous streams to buffer

just enough data to keep the target device busy till another set is available (taking care of dependencies)

and then passing those pointers to Thrust functions. This provides a lot of overlap of computation and/or

memory transfers to eliminate delay.

In contrast with the above, we expected things to work conservatively well but things turned out to be

better than our initial expectations. The mapping from ”host-like” functions to parallel computation would

be straight forward using Thrust and CopperHead, and it was. The speed gains of using Thrust over CUDA C

will enable us to come up with a working prototype, that can be evolutionized to an optimized version, within

a week. It did happen and the time spent was astonishingly short compared to a working prototype using

other frameworks CUDA C included. It took 2.5 hours only to program it from scratch to a fully, bug free,

working prototype. We expected that parts from Thrust version can be kept intact and re-used in the final

optimized product, this worked perfectly and most of code was intact except for the kernel/device functions

incurred by the CUDA C version of the benchmark had to replace thrust function calls. While we could

keep thrust function calls and optimize memory transfers, this may not allow us to compare the completely

CUDA implemented benchmark to the other versions of the benchmark. We expected that isolating bugs

will be much easier when we come up with a working prototype first and then isolating those bugs related to

Indices calculations, limits and threads flocking and/or that are CUDA Specific from host code. This worked

well. Actually, once we came up with the Thrust version, couple of bugs whose origins weren’t known in

CUDA C version, were fixed within an hour afterwards. This is because the prototype proved correctness of

all host code and the only expected errors would be on the device code afterwards.

4.9 Conclusion

In this chapter, we showed through a practical use of frameworks with an example benchmark that:

1. A prototype is worth the time and effort since it helps finding and fixing bugs.

2. A prototype is indeed optimizable afterwards and integratable in a production quality code, especially

when using Thrust.

80

3. Some of expectations can not be predicted and/or proved in advance even in the simplest forms of

parallel programs without a proof of concept (i.e. a prototype in our study). Heuristics may help and

maybe correct at some occasions but not necessarily hit a goal all the time, as we saw in previous sub

section.

We also saw that the use of a more productive language can not necessarily yield a productive work-

flow and/or the performance level required by a parallel program. In addition, an optimization step, and

hence and optimization work-flow, is indispensable after a working prototype has been achieved to remove

some bottle necks. We showed that C/C++ language frameworks, namely CUDA C and Thrust, are best

bet for any developer even when code is required to be run on multi-core CPUs10. During our research

we knew that by abstracting the device details, it is much easier to port to multiple devices using thrust,

and a lot easier to extend from single GPU/Node to multiple GPUs and multiple Nodes using UVA and

RDMA through CUDA-aware MPI calls, respectively. This is of course by using the GPUDirect umbrella

framework provided by nVidia for its GPUs.

10We didn’t come up with a working example because it is straight forward to just change the Thrust GPU version of thebenchmark to use thrust::omp::device vector instead of using thrust::cuda::device vector or thrust::device vector. Previous versionsof Thrust allowed only one device per application by passing a compiler flag to compile the source code to a target framework whilecurrent versions of Thrust allows mix-matching several devices on a single application, thanks to its device-specific container types

81

Figure 4.1: Mapping of blackscholes to CUDA GPU is a straight forward process since it is embarrassinglyparallel problem.

82

Chapter 5

Conclusion, Findings and Lessons Learned

83

In conclusion, we learned a lot from this exercise through the use of multiple frameworks and approaches

to port a simple benchmark that is flexible enough to fit all frameworks without introducing a limitation by

itself on any. By doing this exercise, we were aiming at finding the fastest possible way to prototype and

then evolutionize it into a production quality code through optimization and the use of augmenting lower

level frameworks that provide finer controls over many aspects of the application in question.

We, as a result, contribute our findings as two prototyping work-flows with preference of Thrust version

from start and then using CUDA C for optimization and discouraging the other, python based, work-flow

and some more contributions.

We, also, believe that we made partial success in achieving that since elevating developers from lower

level details of parallel and HPC processing based on currently available technologies is not completely

achievable. Research and efforts in this directions are all over the publications and we think that this is the

most practical direction towards achieving high performance computing with least programmers’ efforts/re-

sources.

5.1 Contributions and Findings

Our most apparent contributions are Two prototyping work-flows, Two optimization work-flows; with

one based on a throw-away prototype using Python-based frameworks, and one evolutionary prototype-to-

optimized work-flow based on C/C++ framework Thrust and then CUDA C Runtime API for optimization

step. A survey of what frameworks are available that target programmers’ productivity and abstraction from

parallel device-specific details and why some may or may not work based on examples provided from their

respective sources/documentations and based on our experience as a short term practitioners in the field.

We contributed a GPU version that is easily portable to multiple GPUs of the PARSEC’s Blackscholes

Benchmark. Finally, we made a case study on CopperHead as a way to parallelize computation.

5.2 Future Work

We think that we still need to do more work in this direction looking at the fact that some emerging tech-

niques and technologies to achieving parallelism are at the verge of making it to market, software vendors

are still looking for some solutions, formal methods are at scarce to help aid achieving the goals and if there

84

were formal tools, still they may mostly be the kind of heavy guns and larger scale HPC instead of tuned

towards small vendors that target mainstream programmers and end users.

So, we intend to continue our research and efforts to do the following:

1. A detailed look at Chapel [13] will be done, with possible formal methods efforts being dedicated

towards this concise language.

2. Studying the not yet released OpenACC since it has the closest resemblance to Thrust in its simplicity,

clarity and level of abstraction but even doing more such as targeting heterogeneous devices (which

thrust does) but also targets different massively parallel accelerators from different vendors rather than

being nVidia specific, like what Thrust is right now.

3. More insight into details and internals of Thrust in an effort to understand and propose new ways to

make it produce more efficient runnables.

4. We intend to use inspirations obtained from above to even further the abstractions, efficiency of pro-

grammer productivity and/or come up with a way to formally verify and reason about parallel programs

correctness, i.e. freedom of bugs and violations of memory models of such devices.

5. We believe synthesizing parallel code instead of letting programmers to do the hard work is a cleaner

approach since it not only provides concise constructs to express parallelism, but also tends to produce

better performing binaries compared to libraries approaches e.g. CopperHead produces a lot better run-

ning binaries than Thrust does especially if developers using the later don’t follow the best practices and

don’t do enough research, profiling and manual work to achieve better performance. So, the direction of

integrated paralelizing and auto-tuning compilers is one interesting direction towards achieving correct,

or mostly correct, parallel code with less time and resources.

85

References

[1] A. Klckner, “Pycuda.” website http://mathema.tician.de/software/pycuda, Feb 2013.

[2] A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih, “PyCUDA and PyOpenCL: A

Scripting-Based Approach to GPU Run-Time Code Generation,” Parallel Computing, vol. 38, no. 3,

pp. 157–174, 2012.

[3] M. G. Bryan Catanzaro, “Copperhead data parallel python.” website http://copperhead.github.io, Dec

2012.

[4] B. Catanzaro, M. Garland, and K. Keutzer, “Copperhead: Compiling an embedded data parallel lan-

guage,” in Proc. 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP

’11, (New York, NY, USA), pp. 47–56, ACM, 2011.

[5] L. Dalcin, “Tutorial - mpi for python v1.3 documentation.” website

http://mpi4py.scipy.org/docs/usrman/tutorial.html, Feb 2013.

[6] N. B. Jared Hoberock, “Thrust - parallel algorithms library.” website http://thrust.githib.com, Mar

2013.

[7] “Parsec benchmark suite official website.” website http://parsec.cs.princeton.edu, Sep 2012.

[8] C. Bienia, Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

[9] M. Harris, “Six ways to saxpy.” website https://developer.nvidia.com/content/six-ways-saxpy, Dec

2012.

[10] unknown, “Openhmpp - hybrid multicore parallel programming.” website

http://en.wikipedia.org/wiki/OpenHMPP, Feb 2013.

86

[11] B. Gaster and L. Howes, “Can gpgpu programming be liberated from the data-parallel bottleneck?,”

Computer, vol. 45, pp. 42 –52, august 2012.

[12] M. H. Michael Wilde, B. C. Justin M. Wozniak, and I. F. Daniel S. Katz, “The swift parallel scripting

language.” website http://www.ci.uchicago.edu/swift/main/index.php, Feb 2013.

[13] A. Sidelnik, S. Maleki, B. Chamberlain, M. Garzaran, and D. Padua, “Performance portability with

the chapel language,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th Inter-

national, pp. 582–594, 2012.

[14] W.-m. W. H. David B. Kirk, Programming Massively Parallel Processors: A Hands-on Approach.

SAN FRANCISCO, United States of America: Morgan Kaufmann Publishers, first ed., Feb 2010.

[15] P. M. Hall, “Cs6235: Parallel programming for many-core architectures slides.” website

http://www.cs.utah.edu/ mhall/cs6235s12/, 2012.

[16] M. Harris, “How to overlap data transfers in cuda c/c++.” website

https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc, Dec 2012.

[17] V. Gordon and J. Bieman, “Rapid prototyping: lessons learned,” Software, IEEE, vol. 12, pp. 85 –95,

jan. 1995.

[18] NVidia, “Nvidia cuda c programming guide.” website http://developer.download.nvidia.com/

compute/DevZone/docs/html/C/doc/CUDA C Programming Guide.pdf, 2012.

[19] M. Harris, “How to optimize data transfers in cuda c/c++.” website

https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc, Dec 2012.

[20] J. Kraus, “An introduction to cuda-aware mpi.” website https://developer.nvidia.com/content/introduction-

cuda-aware-mpi, Mar 2013.

[21] R. Farber, CUDA Application Design and Development. United States of America: Elsevier, Oct 2011.

[22] unknown, “Java binding for cuda.” website http://www.jcuda.org, Feb 2013.

[23] Unknown, “Openacc.” website http://www.openacc-standard.org, Mar 2013.

87

[24] J. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W. Hwu, and N. Obeid, “Algo-

rithm and data optimization techniques for scaling to massively threaded systems,” Computer, vol. 45,

pp. 26 –32, august 2012.

[25] S. Tzeng, B. Lloyd, and J. Owens, “A gpu task-parallel model with dependency resolution,” Computer,

vol. 45, pp. 34 –41, august 2012.

[26] A. Tumeo, S. Secchi, and O. Villa, “Designing next-generation massively multithreaded architectures

for irregular applications,” Computer, vol. 45, pp. 53 –61, august 2012.

[27] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan primitives for gpu computing,” in Proceed-

ings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH ’07,

(Aire-la-Ville, Switzerland, Switzerland), pp. 97–106, Eurographics Association, 2007.

[28] unknown, “Software prototyping.” website http://en.wikipedia.org/wiki/Software prototyping, Feb

2013.

[29] unknown, “Metaprogramming.” website http://en.wikipedia.org/wiki/Metaprogramming, Feb 2013.

[30] unknown, “Closure (computer science).” website http://en.wikipedia.org/wiki/Closure (computer science),

Feb 2013.

88

Appendices

89

APPENDIX A: Brain Map of the whole project report

Final Report

Title: Towards Rapid Prototyping of Parallel and HPC applications programming

NOTE: The original proposal is different but the "changes story" below will smooth the change showing maturation-process of the project.

Changes from proposal:

Beginning I was interested in finding parallel patterns

by parallelizing some PARSEC benchmarks

Then, Discovering patterns and documenting them

However, Interesting Patterns are already there and Parallelizing compilers and frameworks exist

ensure correctness of programs

Frees developer from tedious details

Prototyping using them, assures early understanding of subtle details of application in hand.

Transitioning from a prototype to real optimized implementation then is easy to achieve.

Proposing workflow laterImportance of rapid prototyping with lack/absence of sufficient HPC and Parallel-processing Formal Methods aids

Frameworks evaluated

CUDA C/C++ Streams to increase overlap and speedupSingle-GPU setup

Multi-GPU setupjCuda

PyCuda

CopperheadHow it works (the internals) native executables based on frameworks

(OpenMP, nvcc+thrust-cuda, TBB)

thrust-related limitation

mpi4py multi-node GPU clusters

Swift Heterogeneous?

Best is decided upon then

Criteria of comparisons discussion

Why one feature is better than the other in rapid prototyping

Factors limiting productivity in Parallel and HPC world

Statistics of Performance and Time taken to develop a Parallel/HPC application.

Actual Comparison Big Tabular Feature set comparison

Why One chosen over other(s)

jCuda vs PyCuda vs CUDA C/C++

PyCuda vs CopperHead (Both Python based)

Swift

Others? why or why not.

Proposed workflow from rapid prototyping to actual production code.

thrust What Copperhead can or can't do, it can do perfectly

Practical evaluation

CopperHead vs CUDA C (starting with CUDA C first)

Blackscholes porting (easiest form of parallel application that is embarrassingly parallel) using CUDA C

Time taken

Difficulty and Bugs remaining/faced

Scalability achieved

Blackscholes porting using CopperHead Prelude APIs

Time taken

Difficulty and Bugs remaining/faced

Scalability achieved

Budgeting resources Time allowed in real-world applications development

Using CopperHead Constraints

Prelude APIs

Data Structures interoperability with Python

Managed Heterogeneous memory spaces

Limitations and the transition and why it isn't a big winner

CopperHead vs CUDA C (starting by a copperhead prototype)

This is a proposed additional evaluation after the above node is done with. This is an adventure after evaluation. Developing a complex algorithm and then reporting findings as an application of prototyping before proceeding to actual production code.

FAIL: reasons follow (why we have one more branch below this one for thrust)

3- It constitutes to a "throwaway-prototype" instead of our target evolutionary prototype.

1- Doesn't support multiple GPUs on a single host and hence isn't extensible

2- Produces target src code not exactly modify-able and/or readable by humans for further optimizations or transitioning to more featured framework/device platform

"thrust" as a better abstraction instead of copperhead

It has all Copperhead has

integrates seamlessly with all other CUDA APIs 1- Most Abstract is thrust

2- Second most abstract: CUDA C

3- Lowest API: Cuda Driver API

One version deploy everywhere using a compiler option

A lot of goodiesExtensibility in an OO way

Custom Iterators, functors … etc

Totally interoperable with CUDA C

thrust C++ API's vs CUDA C (Practical productivity comparison)

1- Re-Porting Blackscholes using thrust APIs

2- Optimizing using thrust API's

3- Comparison with CUDA C optimized blackscholes

Productivity ( LOC, time to implement and debug)

Performance comparison (very brief)

90

APPENDIX B: List of latest Running Benchmarks

In this Appendix, we are listing the latest versions of the running benchmarks (i.e. bug free). The

remaining ones are kept only for reference and as a history.

Paths for multiple versions of final benchmarks are based on the root directory of project repository:

CUDA C: can be found in folder Benchmarks/blackscholes cuda adaptive2/.

CUDA C streamed: can be found in folder Benchmarks/blackscholes cuda adaptive async2/.

Thrust: can be found in folder Benchmarks/blackscholes thrust2.

CopperHead: can be found in file Benchmarks/blackscholes copperhead/

black scholes copperhead sample.py.

CopperHead (failed implementation version): can be found in file Benchmarks/

blackscholes copperhead/blackscholes.py.

Also, there are couple of other folders: inputs, samples, and others non stated above. These are, respec-

tively, for input data sets, samples used to understand how PyCUDA and CopperHead are used, and our

trials for moving CopperHead output source code to an actual target C/C++ code (which rendered failure),

some more exercises on multiple frameworks of choice.

91

towards rapid prototyping of parallel and hpc applications (gpu

Documents