[harvard cs264] 10b - cl.oquence: high-level language abstractions for low-level programming (cyrus...

High-Level Language Features for Low-Level Programming

Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/

http://www.cs.cmu.edu/~comar/


0 10 20 30 40 50 60

Reas

on g

iven

for u

se o

f pro

gram

min

g la

ngua

ge

Number of respondents

Cross-platform compatability

Features

Developer experience

Ease of use

Legacy

Performance

Favourite

Required

Only language known

Other

Figure 7: Reasons for Choice of Programming Lan-guage

0 10 20 30 40 50

Tool

type

Number of respondents

IDE

Version Control

Testing

Libraries/Packages

Build Tools

Bug/Change Tracking

Framework

Modelling

Figure 8: Types of tools used

they felt most IDEs imposed on their development activi-ties’. However, four of the five projects which they studiedtargeted parallel super computer hardware platforms, whichleads to different requirements amongst developers.

The lack of use of bug/change tracking tools and buildtools may be explained by Morris [10], as these types oftools are not needed for prototype or throwaway code whichis commonplace in scientific software. Scientists often startprogramming with these types of code [10], and low usage ofthe aforementioned tools carries over when scientists moveon to develop larger and more complex codes.

The survey also asked why the tools were used. A sum-mary of the reasons given by the 46 respondents who an-swered this question and had indicated that tools or pro-grams were used is shown in Figure 9. It should be notedthat the reasons provided are a mixture of reasons for usinga tool at all, and reasons for using a particular tool overanother. Of special interest are the 12 respondents whodescribed the use of version control as ‘required’ or ‘manda-tory’ for large-scale, multi-developer, and/or distributed de-velopment.

3.4 Development Teams and User Base Char-acteristics

As can be seen in Figure 10, most of the survey respon-

0 5 10 15 20

Reas

on fo

r use

of t

ools

Number of respondents (out of 46)

Improve Ease of Coding

Version Control is ’Required’

Features

Project Organisation

Cost (or lack thereof) of Tool

Open Source

Tool is Easy to Use

Using Tool is ’Standard’

Figure 9: Reasons why tools are used

0

10

20

30

40

50

60

SinglePerson

Small Team(2-6 people)

Large Team(7-12 people)

Larger Team(more than 12 people)

Num

ber o

f res

pond

ents

Development Team SizeNeverRarely

SometimesOften

Always

Figure 10: Development team sizes

dents develop software either alone or in a small team ofdevelopers. Few of the respondents often or always developsoftware in teams comprising seven or more members. Dueto limitations in the survey software used, the rating foreach category of development team size was independent ofthe others. For example a single respondent could answer‘always’ for all four categories of development team size, al-though conceptually this does not make sense.From Figure 11, there is a slight tendency for the intended

user base size to be towards individual and small group userbase sizes compared to larger user bases. There is also a veryslight tendency towards user bases being comprised of eitheronly users with programming experience or both users withand without programming experience, as shown in Figure12.

3.5 DocumentationFigure 13 shows the number of respondents who indi-

cated they produce certain types of documentation. Themost common type of documentation produced by respon-dents was comments in the code, selected by 51 out of 60respondents. At the other end of the scale, requirementsdocumentation is the least commonly produced type of doc-umentation with only 18 respondents indicating that theycommonly produce such artifacts.The comparative lack of documentation for requirements

ment group) or letting the users/stakeholders know how thesoftware works (open source, scientific paper publication).

3.8 Non-functional requirementsThe respondents were asked to rate a series of non-functional

requirements on the following Likert scale:

1. very unimportant

2. unimportant

3. neither

4. important

5. very important

This scale was chosen so that the relative importance ofnon-functional requirements could be determined from re-spondents’ answers. A straight ranking of non-functional re-quirements would only indicate how important respondentsconsidered each non-functional requirement in comparisonto others, but would not provide any information regard-ing how important a non-functional requirement was over-all. The neutral response of ‘neither’ was included as somerespondents may not consider a non-functional requirementor are unaware of it.

Non-functional requirements from the Software Require-ments Specification Data Item described in United StatesMilitary-Standard-498 [1] were used and are as follows:

1. Functionality (the ability to perform all required func-tions)

2. Reliability (the ability to perform with correct, consis-tent results)

3. Maintainability (the ability to be easily corrected)

4. Availability (the ability to be accessed and operatedwhen needed)

5. Flexibility (the ability to be easily adapted to changingrequirements)

6. Portability (the ability to be easily modified for a newsoftware/computing environment)

7. Reusability (the ability to be used in multiple applica-tions)

8. Testability (the ability to be easily and thoroughlytested)

9. Usability (the ability to be easily learned and used)

To this list, two more non-functional requirements wereadded:

10. Traceability (the ability to link the knowledge usedto create the application through to the code and theoutput)

11. Performance (the ability to run using a minimum oftime and/or memory)

0

20

40

60

80

100

Rel

iabi

lity

Func

tiona

lity

Usa

bilit

y

Avai

labi

lity

Flex

ibilit

y

Perfo

rman

ce*

Porta

bilit

y

Test

abilit

y

Mai

ntai

nabi

lity

Trac

abilit

y*

Reu

sabi

lity

% o

f res

pond

ents

Very UnimportantUnimportant

NeitherImportant

Very Important

Figure 18: Importance of non-functional require-ments as rated by respondents

Table 1: Combined important and very importantratings for non-functional requirements

Ranking Requirement Combined Importantand Very Important

Ratings (%)1 Reliability 1002 Functionality 953 Maintainability 904 Availability 875 Performance* 796 Flexibility 777 Testability 758 Usability 639 Reusability 6210 Traceability* 5411 Portability 52

These two additional non-functional requirements wereadded based on the responses from the initial pilot sur-vey identified in section 2. The descriptions of each non-functional requirement were provided in the survey.

Figure 18 shows the rated importance of the non-functionalrequirements as a percentage of total responses, ranked inorder of very important ratings. Table 1 lists the non-functional requirements in descending order of combinedimportant and very important ratings. All non-functionalrequirements were rated by 60 respondents, with the excep-tion of traceability and performance (which are marked bya *) which were rated by 52 respondents.

Reliability was considered to be the most important non-functional requirement overall, with 83% of respondents rat-ing it as very important, and the remainder all rating it asimportant. Functionality also rated very highly, with 65%rating it as very important and 30% rating it as important.These two results corroborate previous results from Kellyand Sanders [7], in which ‘the singular importance of cor-rectness’ for scientific software was identified, and Carveret al. [4], where the most highly ranked project goal wascorrectness.

Portability received the highest number of unimportantratings for any non-functional requirement (11), and the low-est combined proportion of important and very important

The Needs of Scientists and Engineers

[Nguyen-Hoan et al, 2010]

C, Fortran, CUDA, OpenCL

FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability

MATLAB, Python, R, Perl

ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments

TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions

SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems

Scientists relieve the tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into a low-level language for performance-critical sections (can be annoying)

The State of Scientific Programming Today







Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)

The State of Scientific Programming Tomorrow

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

OpenCL

// Parallel elementwise sum__kernel void sum(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}

__kernel void sum(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

__kernel void sum(__global short* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

__kernel void sum(__global float* a, __global double* b, __global float* dest) { #pragma ... size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

...

// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}

__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}

OpenCL

// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}

__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

__kernel void sum(__global short* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

__kernel void sum(__global float* a, __global double* b, __global float* dest) { #pragma ... size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

...



OpenCL



__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

__kernel void sum_di(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

...



OpenCL




__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}

...



OpenCL





...



...

OpenCL





...



...

// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}

__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}



...



OpenCL





...


__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index

...


__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);

OpenCL





...



...



My photographs tell stories of loss, human struggle, and personal exploration within landscapes scarred by technology and over-use… [I] strive to metaphorically and poetically link laborious actions, idiosyncratic rituals and strangely crude machines into tales about our modern experience.

Robert ParkeHarrison

OpenCL





...



...



@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''

return a + b

@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''

return a * b

OpenCL





...



...



@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])


return a + b


return a * b

OpenCL



return a + b


return a * b





...



...



OpenCL



return a + b


return a * b





...



...





...



OpenCL



return a + b


return a * b





...



...





...



These two libraries express the same thing.The code will run in precisely the same amount of time.



return a + b


return a * b

Two invocation models1. Standalone compilation to OpenCL

• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python

# Programmatically specialize and assign types to # any externally callable versions you need.

sum = ew_op.specialize(op=plus)prod = ew_op.specialize(op=mul)

g_int_p = cl_int.global_ptrg_float_p = cl_float.global_ptr

sum_ff = sum.compile(g_float_p, g_float_p, g_float_p)sum_ii = sum.compile(g_int_p, g_int_p, g_int_p)



return a + b


return a * b

Two invocation models1. Standalone compilation to OpenCL

• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python

# Programmatically specialize and assign types to # any externally callable versions you need.

sum = ew_op.specialize(op=plus)prod = ew_op.specialize(op=mul)

g_int_p = cl_int.global_ptrg_float_p = cl_float.global_ptr

sum_ff = sum.compile(g_float_p, g_float_p, g_float_p)sum_ii = sum.compile(g_int_p, g_int_p, g_int_p)

clqcc hello.clq

creates hello.cl:

__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}


# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)

# transfer data to devicectx = cl.ctx = cl.Context.for_device(0, 0)a_buf = ctx.to_device(a)b_buf = ctx.to_device(b)dest_buf = ctx.alloc(like=a)

# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus, global_size=a.shape, local_size=(256,)).wait()

# get resultsresult = ctx.from_device(dest_buf)

# check resultsprint la.norm(c -‐ (a + b))



return a + b


return a * b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy



return a + b


return a * b








Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays

Buffers hold metadata (type, shape, order) so you don’t have to provide it.








return a + b


return a * b





Implicit queue associated with each context.



# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus).wait()



@cl.oquence.auto(lambda a, b, dest, op: a.shape, (256,))@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])


return a + b


return a * b






The auto annotation can allow you to hide the details of parallelization from the user.

# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)c = numpy.empty_like(a)

# create an OpenCL contextctx = cl.ctx = cl.Context.for_device(0, 0)

# invoke function (automatically specialized as needed)ew_op(In(a), In(b), Out(c), plus).wait()


@cl.oquence.auto(lambda a, b, dest, op: a.shape, (256,))@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])


return a + b


return a * b






The auto annotation can allow you to hide the details of parallelization from the user.

The In, Out and InOut constructs can help automate data movement when convenient.

OpenCL



return a + b


return a * b





...



...





...



These two libraries express the same thing.The code will run in precisely the same amount of time.

OpenCL



return a + b


return a * b





...



...





...



How?

• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations

(not dynamic lookups)• Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism

• All functions are generic by default• You can call a function with any arguments

that support the operations it uses.•

OpenCL



return a + b


return a * b





...



...





...



How?


(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism


that support the operations it uses.•





...



OpenCL



return a + b


return a * b

...





...



How?


(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism


that support the operations it uses.





...



OpenCL



return a + b


return a * b

...





...



With either invocation model, Python is now your preprocessor

• Functions can be programmatically generated from source or ASTs

• You’re using Python’s well-designed module system instead of the #import system (!)• Use distutils, PyPI, and so on

• The syntax is a subset of Python• Same source code highlighters• Use standard documentation generators

• You can write compiler and language extensions as libraries

• Bonus: default values for function arguments





...



OpenCL



return a + b


return a * b

...





...



Downsides (e.g. open projects, email me!)• No current support for graphical debuggers

• You can optionally include line numbers from original file in comments however

• But mapping is close enough that debugging is not typically a problem and generated source code is formatted nicely

• Calling non-cl.oquence OpenCL libraries requires writing an explicit extern directive• ext_import(“library.cl”)• ext_function = extern(cl.void, cl_int, ...)

Neurobiological Circuit Simulations

from cl.egans import Simulationfrom ahh.cl.egans.spiking.models import ReducedLIFfrom ahh.cl.egans.spiking.inputs import ExponentialSynapse

sim = Simulation(ctx, n_realizations=1, n_timesteps = 10000, DT=0.1)

# Create 4000 LIF neuronsN_Exc = 3200N_Inh = 800N = N_Exc + N_Inhneurons = ReducedLIF(sim, "LIF", count=N, tau=20.0, v_reset=0.0, v_thresh=10.0, abs_refractory_period=5.0)

# Create excitatory and inhibitory synapsese_synapse = ExponentialSynapse(neurons, 'ge', tau=5.0, reversal=60.0)

...

sim.generate()print sim.code

@cl.oquence.fndef step_fn(timestep, realization_start): gid = get_global_id(0) gsize = get_global_size(0) first_idx_sim = realization_start * 4000 last_idx_sim = min(first_idx_sim + 4000, 4000) for idx_sim in (first_idx_sim + gid, last_idx_sim, gsize): realization_num = idx_sim / 4000 realization_first_idx_sim = realization_num * 4000 realization_first_idx_div = (realization_num -‐ realization_start)*4000 idx_realization = idx_sim -‐ realization_first_idx_sim idx_division = idx_sim -‐ first_idx_sim idx_model = idx_realization -‐ 0 idx_state = idx_model + (realization_num -‐ realization_start)*4000 LIF_v = LIF_v_buffer[idx_state] # ... if v_new >= 10.0: LIF_v_buffer[idx_state] = 0.0 target = LIF_ge_AtomicReceiver_out if idx_model < \ 3200 else LIF_gi_AtomicReceiver_out neighbors_offset = neighbor_data[idx_realization] neighbor_size = neighbor_data[neighbors_offset] neighbors = neighbor_data + neighbors_offset + 1 for i in (0, neighbor_size, 1): atom_add(target + realization_first_idx_div + neighbors[i], 1) else: # ...

A modular simulation architecture that uses compile-time code generation to avoid the typical performance penalties.







Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)

The State of Scientific Programming Tomorrow

OpenCL: The Good Parts

Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/

Current Status:• Everything works, just need to clean out some cobwebs.• It will be available at http://cl.oquence.org/ soon (May).• If you want to use it today, email me ([email protected]).• Join clq-announce on Google Groups for release announcement.• Paper will be in submission shortly.



http://cl.oquence.org

http://cl.oquence.org

mailto:[email protected]

mailto:[email protected]

[harvard cs264] 10b - cl.oquence: high-level language abstractions for low-level programming (cyrus...

Education