[harvard cs264] 10b - cl.oquence: high-level language abstractions for low-level programming (cyrus...
DESCRIPTION
http://cs264.orghttp://bit.ly/gjQ3k7TRANSCRIPT
High-Level Language Features for Low-Level Programming
Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/
0 10 20 30 40 50 60
Reas
on g
iven
for u
se o
f pro
gram
min
g la
ngua
ge
Number of respondents
Cross-platform compatability
Features
Developer experience
Ease of use
Legacy
Performance
Favourite
Required
Only language known
Other
Figure 7: Reasons for Choice of Programming Lan-guage
0 10 20 30 40 50
Tool
type
Number of respondents
IDE
Version Control
Testing
Libraries/Packages
Build Tools
Bug/Change Tracking
Framework
Modelling
Figure 8: Types of tools used
they felt most IDEs imposed on their development activi-ties’. However, four of the five projects which they studiedtargeted parallel super computer hardware platforms, whichleads to different requirements amongst developers.
The lack of use of bug/change tracking tools and buildtools may be explained by Morris [10], as these types oftools are not needed for prototype or throwaway code whichis commonplace in scientific software. Scientists often startprogramming with these types of code [10], and low usage ofthe aforementioned tools carries over when scientists moveon to develop larger and more complex codes.
The survey also asked why the tools were used. A sum-mary of the reasons given by the 46 respondents who an-swered this question and had indicated that tools or pro-grams were used is shown in Figure 9. It should be notedthat the reasons provided are a mixture of reasons for usinga tool at all, and reasons for using a particular tool overanother. Of special interest are the 12 respondents whodescribed the use of version control as ‘required’ or ‘manda-tory’ for large-scale, multi-developer, and/or distributed de-velopment.
3.4 Development Teams and User Base Char-acteristics
As can be seen in Figure 10, most of the survey respon-
0 5 10 15 20
Reas
on fo
r use
of t
ools
Number of respondents (out of 46)
Improve Ease of Coding
Version Control is ’Required’
Features
Project Organisation
Cost (or lack thereof) of Tool
Open Source
Tool is Easy to Use
Using Tool is ’Standard’
Figure 9: Reasons why tools are used
0
10
20
30
40
50
60
SinglePerson
Small Team(2-6 people)
Large Team(7-12 people)
Larger Team(more than 12 people)
Num
ber o
f res
pond
ents
Development Team SizeNeverRarely
SometimesOften
Always
Figure 10: Development team sizes
dents develop software either alone or in a small team ofdevelopers. Few of the respondents often or always developsoftware in teams comprising seven or more members. Dueto limitations in the survey software used, the rating foreach category of development team size was independent ofthe others. For example a single respondent could answer‘always’ for all four categories of development team size, al-though conceptually this does not make sense.From Figure 11, there is a slight tendency for the intended
user base size to be towards individual and small group userbase sizes compared to larger user bases. There is also a veryslight tendency towards user bases being comprised of eitheronly users with programming experience or both users withand without programming experience, as shown in Figure12.
3.5 DocumentationFigure 13 shows the number of respondents who indi-
cated they produce certain types of documentation. Themost common type of documentation produced by respon-dents was comments in the code, selected by 51 out of 60respondents. At the other end of the scale, requirementsdocumentation is the least commonly produced type of doc-umentation with only 18 respondents indicating that theycommonly produce such artifacts.The comparative lack of documentation for requirements
ment group) or letting the users/stakeholders know how thesoftware works (open source, scientific paper publication).
3.8 Non-functional requirementsThe respondents were asked to rate a series of non-functional
requirements on the following Likert scale:
1. very unimportant
2. unimportant
3. neither
4. important
5. very important
This scale was chosen so that the relative importance ofnon-functional requirements could be determined from re-spondents’ answers. A straight ranking of non-functional re-quirements would only indicate how important respondentsconsidered each non-functional requirement in comparisonto others, but would not provide any information regard-ing how important a non-functional requirement was over-all. The neutral response of ‘neither’ was included as somerespondents may not consider a non-functional requirementor are unaware of it.
Non-functional requirements from the Software Require-ments Specification Data Item described in United StatesMilitary-Standard-498 [1] were used and are as follows:
1. Functionality (the ability to perform all required func-tions)
2. Reliability (the ability to perform with correct, consis-tent results)
3. Maintainability (the ability to be easily corrected)
4. Availability (the ability to be accessed and operatedwhen needed)
5. Flexibility (the ability to be easily adapted to changingrequirements)
6. Portability (the ability to be easily modified for a newsoftware/computing environment)
7. Reusability (the ability to be used in multiple applica-tions)
8. Testability (the ability to be easily and thoroughlytested)
9. Usability (the ability to be easily learned and used)
To this list, two more non-functional requirements wereadded:
10. Traceability (the ability to link the knowledge usedto create the application through to the code and theoutput)
11. Performance (the ability to run using a minimum oftime and/or memory)
0
20
40
60
80
100
Rel
iabi
lity
Func
tiona
lity
Usa
bilit
y
Avai
labi
lity
Flex
ibilit
y
Perfo
rman
ce*
Porta
bilit
y
Test
abilit
y
Mai
ntai
nabi
lity
Trac
abilit
y*
Reu
sabi
lity
% o
f res
pond
ents
Very UnimportantUnimportant
NeitherImportant
Very Important
Figure 18: Importance of non-functional require-ments as rated by respondents
Table 1: Combined important and very importantratings for non-functional requirements
Ranking Requirement Combined Importantand Very Important
Ratings (%)1 Reliability 1002 Functionality 953 Maintainability 904 Availability 875 Performance* 796 Flexibility 777 Testability 758 Usability 639 Reusability 6210 Traceability* 5411 Portability 52
These two additional non-functional requirements wereadded based on the responses from the initial pilot sur-vey identified in section 2. The descriptions of each non-functional requirement were provided in the survey.
Figure 18 shows the rated importance of the non-functionalrequirements as a percentage of total responses, ranked inorder of very important ratings. Table 1 lists the non-functional requirements in descending order of combinedimportant and very important ratings. All non-functionalrequirements were rated by 60 respondents, with the excep-tion of traceability and performance (which are marked bya *) which were rated by 52 respondents.
Reliability was considered to be the most important non-functional requirement overall, with 83% of respondents rat-ing it as very important, and the remainder all rating it asimportant. Functionality also rated very highly, with 65%rating it as very important and 30% rating it as important.These two results corroborate previous results from Kellyand Sanders [7], in which ‘the singular importance of cor-rectness’ for scientific software was identified, and Carveret al. [4], where the most highly ranked project goal wascorrectness.
Portability received the highest number of unimportantratings for any non-functional requirement (11), and the low-est combined proportion of important and very important
The Needs of Scientists and Engineers
[Nguyen-Hoan et al, 2010]
C, Fortran, CUDA, OpenCL
FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability
MATLAB, Python, R, Perl
ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments
TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions
SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems
Scientists relieve the tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into a low-level language for performance-critical sections (can be annoying)
The State of Scientific Programming Today
C, Fortran, CUDA, OpenCL
FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability
MATLAB, Python, R, Perl
ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments
TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions
SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems
Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)
The State of Scientific Programming Tomorrow
What is cl.oquence?
A low-level programming language that maps closely onto, and compiles down to, OpenCL.
What is OpenCL?
OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.
What is a heterogeneous computing environment?
A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.
Why should I use cl.oquence?
• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:
• Higher-order functions• Default arguments for functions
• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design
• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python
• Built on top of pyopencl and numpy
What is cl.oquence?
A low-level programming language that maps closely onto, and compiles down to, OpenCL.
What is OpenCL?
OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.
What is a heterogeneous computing environment?
A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.
Why should I use cl.oquence?
• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:
• Higher-order functions• Default arguments for functions
• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design
• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python
• Built on top of pyopencl and numpy
What is cl.oquence?
A low-level programming language that maps closely onto, and compiles down to, OpenCL.
What is OpenCL?
OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.
What is a heterogeneous computing environment?
A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.
Why should I use cl.oquence?
• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:
• Higher-order functions• Default arguments for functions
• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design
• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python
• Built on top of pyopencl and numpy
What is cl.oquence?
A low-level programming language that maps closely onto, and compiles down to, OpenCL.
What is OpenCL?
OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.
What is a heterogeneous computing environment?
A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.
Why should I use cl.oquence?
• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:
• Higher-order functions• Default arguments for functions
• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design
• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python
• Built on top of pyopencl and numpy
OpenCL
// Parallel elementwise sum__kernel void sum(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global short* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global float* a, __global double* b, __global float* dest) { #pragma ... size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global short* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global float* a, __global double* b, __global float* dest) { #pragma ... size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global short* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum(__global float* a, __global double* b, __global float* dest) { #pragma ... size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_di(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
My photographs tell stories of loss, human struggle, and personal exploration within landscapes scarred by technology and over-use… [I] strive to metaphorically and poetically link laborious actions, idiosyncratic rituals and strangely crude machines into tales about our modern experience.
Robert ParkeHarrison
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
OpenCL
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0);
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
These two libraries express the same thing.The code will run in precisely the same amount of time.
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL
• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python
# Programmatically specialize and assign types to # any externally callable versions you need.
sum = ew_op.specialize(op=plus)prod = ew_op.specialize(op=mul)
g_int_p = cl_int.global_ptrg_float_p = cl_float.global_ptr
sum_ff = sum.compile(g_float_p, g_float_p, g_float_p)sum_ii = sum.compile(g_int_p, g_int_p, g_int_p)
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL
• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python
# Programmatically specialize and assign types to # any externally callable versions you need.
sum = ew_op.specialize(op=plus)prod = ew_op.specialize(op=mul)
g_int_p = cl_int.global_ptrg_float_p = cl_float.global_ptr
sum_ff = sum.compile(g_float_p, g_float_p, g_float_p)sum_ii = sum.compile(g_int_p, g_int_p, g_int_p)
clqcc hello.clq
creates hello.cl:
__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)
# transfer data to devicectx = cl.ctx = cl.Context.for_device(0, 0)a_buf = ctx.to_device(a)b_buf = ctx.to_device(b)dest_buf = ctx.alloc(like=a)
# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus, global_size=a.shape, local_size=(256,)).wait()
# get resultsresult = ctx.from_device(dest_buf)
# check resultsprint la.norm(c -‐ (a + b))
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language
• Python + pyopencl (w/extensions) + numpy
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language
• Python + pyopencl (w/extensions) + numpy
# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)
# transfer data to devicectx = cl.ctx = cl.Context.for_device(0, 0)a_buf = ctx.to_device(a)b_buf = ctx.to_device(b)dest_buf = ctx.alloc(like=a)
# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus, global_size=a.shape, local_size=(256,)).wait()
# get resultsresult = ctx.from_device(dest_buf)
# check resultsprint la.norm(c -‐ (a + b))
Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays
Buffers hold metadata (type, shape, order) so you don’t have to provide it.
# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)
# transfer data to devicectx = cl.ctx = cl.Context.for_device(0, 0)a_buf = ctx.to_device(a)b_buf = ctx.to_device(b)dest_buf = ctx.alloc(like=a)
# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus, global_size=a.shape, local_size=(256,)).wait()
# get resultsresult = ctx.from_device(dest_buf)
# check resultsprint la.norm(c -‐ (a + b))
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language
• Python + pyopencl (w/extensions) + numpy
Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays
Buffers hold metadata (type, shape, order) so you don’t have to provide it.
Implicit queue associated with each context.
# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)
# transfer data to devicectx = cl.ctx = cl.Context.for_device(0, 0)a_buf = ctx.to_device(a)b_buf = ctx.to_device(b)dest_buf = ctx.alloc(like=a)
# invoke function (automatically specialized as needed)ew_op(a_buf, b_buf, dest_buf, plus).wait()
# get resultsresult = ctx.from_device(dest_buf)
# check resultsprint la.norm(c -‐ (a + b))
@cl.oquence.auto(lambda a, b, dest, op: a.shape, (256,))@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language
• Python + pyopencl (w/extensions) + numpy
Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays
Buffers hold metadata (type, shape, order) so you don’t have to provide it.
Implicit queue associated with each context.
The auto annotation can allow you to hide the details of parallelization from the user.
# allocate two random arrays that we will be addinga = numpy.random.rand(50000).astype(numpy.float32)b = numpy.random.rand(50000).astype(numpy.float32)c = numpy.empty_like(a)
# create an OpenCL contextctx = cl.ctx = cl.Context.for_device(0, 0)
# invoke function (automatically specialized as needed)ew_op(In(a), In(b), Out(c), plus).wait()
# check resultsprint la.norm(c -‐ (a + b))
@cl.oquence.auto(lambda a, b, dest, op: a.shape, (256,))@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language
• Python + pyopencl (w/extensions) + numpy
Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays
Buffers hold metadata (type, shape, order) so you don’t have to provide it.
Implicit queue associated with each context.
The auto annotation can allow you to hide the details of parallelization from the user.
The In, Out and InOut constructs can help automate data movement when convenient.
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
These two libraries express the same thing.The code will run in precisely the same amount of time.
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
How?
• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations
(not dynamic lookups)• Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism
• All functions are generic by default• You can call a function with any arguments
that support the operations it uses.•
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
How?
• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations
(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism
• All functions are generic by default• You can call a function with any arguments
that support the operations it uses.•
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
How?
• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations
(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism
• All functions are generic by default• You can call a function with any arguments
that support the operations it uses.
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
With either invocation model, Python is now your preprocessor
• Functions can be programmatically generated from source or ASTs
• You’re using Python’s well-designed module system instead of the #import system (!)• Use distutils, PyPI, and so on
• The syntax is a subset of Python• Same source code highlighters• Use standard documentation generators
• You can write compiler and language extensions as libraries
• Bonus: default values for function arguments
// Parallel elementwise sum__kernel void sum_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
__kernel void sum_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
OpenCL
@cl.oquence.fndef ew_op(a, b, dest, op): '''Parallel elementwise binary operation.''' gid = get_global_id(0) # Get thread index dest[gid] = op(a[gid], b[gid])
@cl.oquence.fndef plus(a, b): '''Adds the two operands.'''
return a + b
@cl.oquence.fndef mul(a, b): '''Multiplies the two operands.'''
return a * b
...
// Parallel elementwise product__kernel void prod_ff(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod_ii(__global int* a, __global int* b, __global int* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_fi(__global float* a, __global int* b, __global float* dest) { size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
__kernel void sum_df(__global double* a, __global int* b, __global double* dest) { #pragma OPENCL EXTENSION cl_khr_fp64 : enable size_t gid = get_global_id(0); dest[gid] = a[gid] + b[gid];}
...
// Parallel elementwise product__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] * b[gid];}
__kernel void prod(__global float* a, __global float* b, __global float* dest) { size_t gid = get_global_id(0); // Get thread index dest[gid] = a[gid] + b[gid];}
Downsides (e.g. open projects, email me!)• No current support for graphical debuggers
• You can optionally include line numbers from original file in comments however
• But mapping is close enough that debugging is not typically a problem and generated source code is formatted nicely
• Calling non-cl.oquence OpenCL libraries requires writing an explicit extern directive• ext_import(“library.cl”)• ext_function = extern(cl.void, cl_int, ...)
Neurobiological Circuit Simulations
from cl.egans import Simulationfrom ahh.cl.egans.spiking.models import ReducedLIFfrom ahh.cl.egans.spiking.inputs import ExponentialSynapse
sim = Simulation(ctx, n_realizations=1, n_timesteps = 10000, DT=0.1)
# Create 4000 LIF neuronsN_Exc = 3200N_Inh = 800N = N_Exc + N_Inhneurons = ReducedLIF(sim, "LIF", count=N, tau=20.0, v_reset=0.0, v_thresh=10.0, abs_refractory_period=5.0)
# Create excitatory and inhibitory synapsese_synapse = ExponentialSynapse(neurons, 'ge', tau=5.0, reversal=60.0)
...
sim.generate()print sim.code
@cl.oquence.fndef step_fn(timestep, realization_start): gid = get_global_id(0) gsize = get_global_size(0) first_idx_sim = realization_start * 4000 last_idx_sim = min(first_idx_sim + 4000, 4000) for idx_sim in (first_idx_sim + gid, last_idx_sim, gsize): realization_num = idx_sim / 4000 realization_first_idx_sim = realization_num * 4000 realization_first_idx_div = (realization_num -‐ realization_start)*4000 idx_realization = idx_sim -‐ realization_first_idx_sim idx_division = idx_sim -‐ first_idx_sim idx_model = idx_realization -‐ 0 idx_state = idx_model + (realization_num -‐ realization_start)*4000 LIF_v = LIF_v_buffer[idx_state] # ... if v_new >= 10.0: LIF_v_buffer[idx_state] = 0.0 target = LIF_ge_AtomicReceiver_out if idx_model < \ 3200 else LIF_gi_AtomicReceiver_out neighbors_offset = neighbor_data[idx_realization] neighbor_size = neighbor_data[neighbors_offset] neighbors = neighbor_data + neighbors_offset + 1 for i in (0, neighbor_size, 1): atom_add(target + realization_first_idx_div + neighbors[i], 1) else: # ...
A modular simulation architecture that uses compile-time code generation to avoid the typical performance penalties.
C, Fortran, CUDA, OpenCL
FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability
MATLAB, Python, R, Perl
ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments
TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions
SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems
Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)
The State of Scientific Programming Tomorrow
OpenCL: The Good Parts
Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/
Current Status:• Everything works, just need to clean out some cobwebs.• It will be available at http://cl.oquence.org/ soon (May).• If you want to use it today, email me ([email protected]).• Join clq-announce on Google Groups for release announcement.• Paper will be in submission shortly.