[harvard cs264] 10b - cl.oquence: high-level language abstractions for low-level programming (cyrus...

39
High-Level Language Features for Low-Level Programming Cyrus Omar Computer Science Department Carnegie Mellon University http://www.cs.cmu.edu/~comar/

Upload: npinto

Post on 28-Nov-2014

1.125 views

Category:

Education


0 download

DESCRIPTION

http://cs264.orghttp://bit.ly/gjQ3k7

TRANSCRIPT

Page 1: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

High-Level Language Features for Low-Level Programming

Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/

Page 2: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

0 10 20 30 40 50 60

Reas

on g

iven

for u

se o

f pro

gram

min

g la

ngua

ge

Number of respondents

Cross-platform compatability

Features

Developer experience

Ease of use

Legacy

Performance

Favourite

Required

Only language known

Other

Figure 7: Reasons for Choice of Programming Lan-guage

0 10 20 30 40 50

Tool

type

Number of respondents

IDE

Version Control

Testing

Libraries/Packages

Build Tools

Bug/Change Tracking

Framework

Modelling

Figure 8: Types of tools used

they felt most IDEs imposed on their development activi-ties’. However, four of the five projects which they studiedtargeted parallel super computer hardware platforms, whichleads to different requirements amongst developers.

The lack of use of bug/change tracking tools and buildtools may be explained by Morris [10], as these types oftools are not needed for prototype or throwaway code whichis commonplace in scientific software. Scientists often startprogramming with these types of code [10], and low usage ofthe aforementioned tools carries over when scientists moveon to develop larger and more complex codes.

The survey also asked why the tools were used. A sum-mary of the reasons given by the 46 respondents who an-swered this question and had indicated that tools or pro-grams were used is shown in Figure 9. It should be notedthat the reasons provided are a mixture of reasons for usinga tool at all, and reasons for using a particular tool overanother. Of special interest are the 12 respondents whodescribed the use of version control as ‘required’ or ‘manda-tory’ for large-scale, multi-developer, and/or distributed de-velopment.

3.4 Development Teams and User Base Char-acteristics

As can be seen in Figure 10, most of the survey respon-

0 5 10 15 20

Reas

on fo

r use

of t

ools

Number of respondents (out of 46)

Improve Ease of Coding

Version Control is ’Required’

Features

Project Organisation

Cost (or lack thereof) of Tool

Open Source

Tool is Easy to Use

Using Tool is ’Standard’

Figure 9: Reasons why tools are used

0

10

20

30

40

50

60

SinglePerson

Small Team(2-6 people)

Large Team(7-12 people)

Larger Team(more than 12 people)

Num

ber o

f res

pond

ents

Development Team SizeNeverRarely

SometimesOften

Always

Figure 10: Development team sizes

dents develop software either alone or in a small team ofdevelopers. Few of the respondents often or always developsoftware in teams comprising seven or more members. Dueto limitations in the survey software used, the rating foreach category of development team size was independent ofthe others. For example a single respondent could answer‘always’ for all four categories of development team size, al-though conceptually this does not make sense.From Figure 11, there is a slight tendency for the intended

user base size to be towards individual and small group userbase sizes compared to larger user bases. There is also a veryslight tendency towards user bases being comprised of eitheronly users with programming experience or both users withand without programming experience, as shown in Figure12.

3.5 DocumentationFigure 13 shows the number of respondents who indi-

cated they produce certain types of documentation. Themost common type of documentation produced by respon-dents was comments in the code, selected by 51 out of 60respondents. At the other end of the scale, requirementsdocumentation is the least commonly produced type of doc-umentation with only 18 respondents indicating that theycommonly produce such artifacts.The comparative lack of documentation for requirements

ment group) or letting the users/stakeholders know how thesoftware works (open source, scientific paper publication).

3.8 Non-functional requirementsThe respondents were asked to rate a series of non-functional

requirements on the following Likert scale:

1. very unimportant

2. unimportant

3. neither

4. important

5. very important

This scale was chosen so that the relative importance ofnon-functional requirements could be determined from re-spondents’ answers. A straight ranking of non-functional re-quirements would only indicate how important respondentsconsidered each non-functional requirement in comparisonto others, but would not provide any information regard-ing how important a non-functional requirement was over-all. The neutral response of ‘neither’ was included as somerespondents may not consider a non-functional requirementor are unaware of it.

Non-functional requirements from the Software Require-ments Specification Data Item described in United StatesMilitary-Standard-498 [1] were used and are as follows:

1. Functionality (the ability to perform all required func-tions)

2. Reliability (the ability to perform with correct, consis-tent results)

3. Maintainability (the ability to be easily corrected)

4. Availability (the ability to be accessed and operatedwhen needed)

5. Flexibility (the ability to be easily adapted to changingrequirements)

6. Portability (the ability to be easily modified for a newsoftware/computing environment)

7. Reusability (the ability to be used in multiple applica-tions)

8. Testability (the ability to be easily and thoroughlytested)

9. Usability (the ability to be easily learned and used)

To this list, two more non-functional requirements wereadded:

10. Traceability (the ability to link the knowledge usedto create the application through to the code and theoutput)

11. Performance (the ability to run using a minimum oftime and/or memory)

0

20

40

60

80

100

Rel

iabi

lity

Func

tiona

lity

Usa

bilit

y

Avai

labi

lity

Flex

ibilit

y

Perfo

rman

ce*

Porta

bilit

y

Test

abilit

y

Mai

ntai

nabi

lity

Trac

abilit

y*

Reu

sabi

lity

% o

f res

pond

ents

Very UnimportantUnimportant

NeitherImportant

Very Important

Figure 18: Importance of non-functional require-ments as rated by respondents

Table 1: Combined important and very importantratings for non-functional requirements

Ranking Requirement Combined Importantand Very Important

Ratings (%)1 Reliability 1002 Functionality 953 Maintainability 904 Availability 875 Performance* 796 Flexibility 777 Testability 758 Usability 639 Reusability 6210 Traceability* 5411 Portability 52

These two additional non-functional requirements wereadded based on the responses from the initial pilot sur-vey identified in section 2. The descriptions of each non-functional requirement were provided in the survey.

Figure 18 shows the rated importance of the non-functionalrequirements as a percentage of total responses, ranked inorder of very important ratings. Table 1 lists the non-functional requirements in descending order of combinedimportant and very important ratings. All non-functionalrequirements were rated by 60 respondents, with the excep-tion of traceability and performance (which are marked bya *) which were rated by 52 respondents.

Reliability was considered to be the most important non-functional requirement overall, with 83% of respondents rat-ing it as very important, and the remainder all rating it asimportant. Functionality also rated very highly, with 65%rating it as very important and 30% rating it as important.These two results corroborate previous results from Kellyand Sanders [7], in which ‘the singular importance of cor-rectness’ for scientific software was identified, and Carveret al. [4], where the most highly ranked project goal wascorrectness.

Portability received the highest number of unimportantratings for any non-functional requirement (11), and the low-est combined proportion of important and very important

The Needs of Scientists and Engineers

[Nguyen-Hoan et al, 2010]

Page 3: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

C, Fortran, CUDA, OpenCL

FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability

MATLAB, Python, R, Perl

ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments

TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions

SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems

Scientists relieve the tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into a low-level language for performance-critical sections (can be annoying)

The State of Scientific Programming Today

Page 4: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

C, Fortran, CUDA, OpenCL

FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability

MATLAB, Python, R, Perl

ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments

TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions

SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems

Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)

The State of Scientific Programming Tomorrow

Page 5: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

Page 6: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is an environment where many different compute devices and address spaces are available. Devices can include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

Page 7: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

Page 8: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

What is cl.oquence?

A low-level programming language that maps closely onto, and compiles down to, OpenCL.

What is OpenCL?

OpenCL is an emerging standard for low-level programming in heterogeneous computing environments. It is designed as a library that can be used from a variety of higher-level language.

What is a heterogeneous computing environment?

A heterogeneous computing environment is one where many different devices and address spaces must be managed. Examples of devices include multi-core CPUs (using a variety of instruction sets), GPUs, hybrid-core processors like the Cell BE and other specialized accelerators.

Why should I use cl.oquence?

• Same core type system (including pointers) and performance profile as OpenCL • Usable from any host language that has OpenCL bindings• Type inference and extension inference eliminates annotational burden• Simplified syntax is a subset of Python, can use existing tools• Structural polymorphism gives you generic programming by default• New features:

• Higher-order functions• Default arguments for functions

• Python as the preprocessor and module system• Rich support for compile-time metaprogramming• Write compiler extensions, new basic types as libraries; modular, clean design

• Light-weight and easy to integrate into any build process• Packaged with special Python host bindings that eliminate even basic overhead when using from within Python

• Built on top of pyopencl and numpy

Page 9: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum(__global  float*  a,  __global  float*  b,                                      __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  int*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 10: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 11: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  short*  a,  __global  int*  b,                                      __global  int*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum(__global  float*  a,  __global  double*  b,                                      __global  float*  dest)  {        #pragma  ...        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 12: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_di(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 13: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 14: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

Page 15: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 16: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

Page 17: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

My photographs tell stories of loss, human struggle, and personal exploration within landscapes scarred by technology and over-use… [I] strive to metaphorically and poetically link laborious actions, idiosyncratic rituals and strangely crude machines into tales about our modern experience.

Robert ParkeHarrison

Page 18: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Page 19: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Page 20: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

Page 21: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);  

Page 22: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Page 23: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

These two libraries express the same thing.The code will run in precisely the same amount of time.

Page 24: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL

• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python

#  Programmatically  specialize  and  assign  types  to  #  any  externally  callable  versions  you  need.

sum  =  ew_op.specialize(op=plus)prod  =  ew_op.specialize(op=mul)

g_int_p  =  cl_int.global_ptrg_float_p  =  cl_float.global_ptr

sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p)sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)

Page 25: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL

• Use any host language that has OpenCL bindings available• C• C++• Fortran• MATLAB• Java• .NET• Ruby• Python

#  Programmatically  specialize  and  assign  types  to  #  any  externally  callable  versions  you  need.

sum  =  ew_op.specialize(op=plus)prod  =  ew_op.specialize(op=mul)

g_int_p  =  cl_int.global_ptrg_float_p  =  cl_float.global_ptr

sum_ff  =  sum.compile(g_float_p,  g_float_p,  g_float_p)sum_ii  =  sum.compile(g_int_p,  g_int_p,  g_int_p)

clqcc  hello.clq

creates hello.cl:

__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

Page 26: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

#  allocate  two  random  arrays  that  we  will  be  addinga  =  numpy.random.rand(50000).astype(numpy.float32)b  =  numpy.random.rand(50000).astype(numpy.float32)

#  transfer  data  to  devicectx  =  cl.ctx  =  cl.Context.for_device(0,  0)a_buf  =  ctx.to_device(a)b_buf  =  ctx.to_device(b)dest_buf  =  ctx.alloc(like=a)

#  invoke  function  (automatically  specialized  as  needed)ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait()

#  get  resultsresult  =  ctx.from_device(dest_buf)

#  check  resultsprint  la.norm(c  -­‐  (a  +  b))

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy

Page 27: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy

#  allocate  two  random  arrays  that  we  will  be  addinga  =  numpy.random.rand(50000).astype(numpy.float32)b  =  numpy.random.rand(50000).astype(numpy.float32)

#  transfer  data  to  devicectx  =  cl.ctx  =  cl.Context.for_device(0,  0)a_buf  =  ctx.to_device(a)b_buf  =  ctx.to_device(b)dest_buf  =  ctx.alloc(like=a)

#  invoke  function  (automatically  specialized  as  needed)ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait()

#  get  resultsresult  =  ctx.from_device(dest_buf)

#  check  resultsprint  la.norm(c  -­‐  (a  +  b))

Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays

Buffers hold metadata (type, shape, order) so you don’t have to provide it.

Page 28: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

#  allocate  two  random  arrays  that  we  will  be  addinga  =  numpy.random.rand(50000).astype(numpy.float32)b  =  numpy.random.rand(50000).astype(numpy.float32)

#  transfer  data  to  devicectx  =  cl.ctx  =  cl.Context.for_device(0,  0)a_buf  =  ctx.to_device(a)b_buf  =  ctx.to_device(b)dest_buf  =  ctx.alloc(like=a)

#  invoke  function  (automatically  specialized  as  needed)ew_op(a_buf,  b_buf,  dest_buf,  plus,              global_size=a.shape,  local_size=(256,)).wait()

#  get  resultsresult  =  ctx.from_device(dest_buf)

#  check  resultsprint  la.norm(c  -­‐  (a  +  b))

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy

Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays

Buffers hold metadata (type, shape, order) so you don’t have to provide it.

Implicit queue associated with each context.

Page 29: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

#  allocate  two  random  arrays  that  we  will  be  addinga  =  numpy.random.rand(50000).astype(numpy.float32)b  =  numpy.random.rand(50000).astype(numpy.float32)

#  transfer  data  to  devicectx  =  cl.ctx  =  cl.Context.for_device(0,  0)a_buf  =  ctx.to_device(a)b_buf  =  ctx.to_device(b)dest_buf  =  ctx.alloc(like=a)

#  invoke  function  (automatically  specialized  as  needed)ew_op(a_buf,  b_buf,  dest_buf,  plus).wait()

#  get  resultsresult  =  ctx.from_device(dest_buf)

#  check  resultsprint  la.norm(c  -­‐  (a  +  b))

@cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,))@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy

Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays

Buffers hold metadata (type, shape, order) so you don’t have to provide it.

Implicit queue associated with each context.

The auto annotation can allow you to hide the details of parallelization from the user.

Page 30: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

#  allocate  two  random  arrays  that  we  will  be  addinga  =  numpy.random.rand(50000).astype(numpy.float32)b  =  numpy.random.rand(50000).astype(numpy.float32)c  =  numpy.empty_like(a)

#  create  an  OpenCL  contextctx  =  cl.ctx  =  cl.Context.for_device(0,  0)

#  invoke  function  (automatically  specialized  as  needed)ew_op(In(a),  In(b),  Out(c),  plus).wait()

#  check  resultsprint  la.norm(c  -­‐  (a  +  b))

@cl.oquence.auto(lambda  a,  b,  dest,  op:  a.shape,  (256,))@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

Two invocation models1. Standalone compilation to OpenCL2. Integrated into a host language

• Python + pyopencl (w/extensions) + numpy

Four simple memory management functions1. to_device: numpy array => new buffer2. from_device: buffer => new numpy array3. alloc: empty buffer4. copy: copies between existing buffers or arrays

Buffers hold metadata (type, shape, order) so you don’t have to provide it.

Implicit queue associated with each context.

The auto annotation can allow you to hide the details of parallelization from the user.

The In, Out and InOut constructs can help automate data movement when convenient.

Page 31: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

These two libraries express the same thing.The code will run in precisely the same amount of time.

Page 32: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

How?

• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations

(not dynamic lookups)• Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism

• All functions are generic by default• You can call a function with any arguments

that support the operations it uses.•

Page 33: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

How?

• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations

(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism

• All functions are generic by default• You can call a function with any arguments

that support the operations it uses.•

Page 34: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

How?

• cl.oquence.fn code looks like Python, but no!• Same core type system as OpenCL (C99+)• Type inference to eliminate type annotations

(not dynamic lookups) • Extension inference to eliminate pragmas• Higher-order functions (inlined at compile-time)• Structural polymorphism

• All functions are generic by default• You can call a function with any arguments

that support the operations it uses.

Page 35: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

With either invocation model, Python is now your preprocessor

• Functions can be programmatically generated from source or ASTs

• You’re using Python’s well-designed module system instead of the #import system (!)• Use distutils, PyPI, and so on

• The syntax is a subset of Python• Same source code highlighters• Use standard documentation generators

• You can write compiler and language extensions as libraries

• Bonus: default values for function arguments

Page 36: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

//  Parallel  elementwise  sum__kernel  void  sum_ff(__global  float*  a,  __global  float*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

OpenCL

@cl.oquence.fndef  ew_op(a,  b,  dest,  op):        '''Parallel  elementwise  binary  operation.'''        gid  =  get_global_id(0)                  #  Get  thread  index        dest[gid]  =  op(a[gid],  b[gid])

@cl.oquence.fndef  plus(a,  b):        '''Adds  the  two  operands.'''

return  a  +  b

@cl.oquence.fndef  mul(a,  b):        '''Multiplies  the  two  operands.'''

return  a  *  b

...

//  Parallel  elementwise  product__kernel  void  prod_ff(__global  float*  a,  __global  float*  b,                                              __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod_ii(__global  int*  a,  __global  int*  b,                                            __global  int*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_fi(__global  float*  a,  __global  int*  b,                                            __global  float*  dest)  {        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

__kernel  void  sum_df(__global  double*  a,  __global  int*  b,                                            __global  double*  dest)  {        #pragma  OPENCL  EXTENSION  cl_khr_fp64  :  enable        size_t  gid  =  get_global_id(0);          dest[gid]  =  a[gid]  +  b[gid];}

...

//  Parallel  elementwise  product__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  *  b[gid];}

__kernel  void  prod(__global  float*  a,  __global  float*  b,                                        __global  float*  dest)  {        size_t  gid  =  get_global_id(0);  //  Get  thread  index        dest[gid]  =  a[gid]  +  b[gid];}

Downsides (e.g. open projects, email me!)• No current support for graphical debuggers

• You can optionally include line numbers from original file in comments however

• But mapping is close enough that debugging is not typically a problem and generated source code is formatted nicely

• Calling non-cl.oquence OpenCL libraries requires writing an explicit extern directive• ext_import(“library.cl”)• ext_function  =  extern(cl.void,  cl_int,  ...)

Page 37: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

Neurobiological Circuit Simulations

from  cl.egans  import  Simulationfrom  ahh.cl.egans.spiking.models  import  ReducedLIFfrom  ahh.cl.egans.spiking.inputs  import  ExponentialSynapse

sim  =  Simulation(ctx,        n_realizations=1,        n_timesteps  =  10000,        DT=0.1)

#  Create  4000  LIF  neuronsN_Exc  =  3200N_Inh  =  800N  =  N_Exc  +  N_Inhneurons  =  ReducedLIF(sim,  "LIF",          count=N,        tau=20.0,        v_reset=0.0,        v_thresh=10.0,        abs_refractory_period=5.0)

#  Create  excitatory  and  inhibitory  synapsese_synapse  =  ExponentialSynapse(neurons,  'ge',        tau=5.0,        reversal=60.0)

...

sim.generate()print  sim.code

@cl.oquence.fndef  step_fn(timestep,  realization_start):        gid  =  get_global_id(0)        gsize  =  get_global_size(0)        first_idx_sim  =  realization_start  *  4000        last_idx_sim  =  min(first_idx_sim  +  4000,  4000)        for  idx_sim  in  (first_idx_sim  +  gid,  last_idx_sim,  gsize):                realization_num  =  idx_sim  /  4000                realization_first_idx_sim  =  realization_num  *  4000                realization_first_idx_div  =  (realization_num  -­‐                          realization_start)*4000                idx_realization  =  idx_sim  -­‐  realization_first_idx_sim                idx_division  =  idx_sim  -­‐  first_idx_sim                idx_model  =  idx_realization  -­‐  0                idx_state  =  idx_model  +  (realization_num  -­‐                          realization_start)*4000                LIF_v  =  LIF_v_buffer[idx_state]                #  ...                if  v_new  >=  10.0:                        LIF_v_buffer[idx_state]  =  0.0                        target  =  LIF_ge_AtomicReceiver_out  if  idx_model  <  \                                3200  else  LIF_gi_AtomicReceiver_out                        neighbors_offset  =  neighbor_data[idx_realization]                        neighbor_size  =  neighbor_data[neighbors_offset]                        neighbors  =  neighbor_data  +  neighbors_offset  +  1                        for  i  in  (0,  neighbor_size,  1):                                atom_add(target  +  realization_first_idx_div  +                                          neighbors[i],  1)                else:                        #  ...

A modular simulation architecture that uses compile-time code generation to avoid the typical performance penalties.

Page 38: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

C, Fortran, CUDA, OpenCL

FastControl over memory allocationControl over data movementAccess to hardware primitivesPortability

MATLAB, Python, R, Perl

ProductiveLow syntactic overheadRead-eval-print loop (REPL)Flexible data structures and abstractionsNice development environments

TediousType annotations, templates, pragmasObtuse compilers, linkers, preprocessorsNo support for high-level abstractions

SlowDynamic lookups and indirection aboundAutomatic memory management can cause problems

Scientists relieve any remaining tension by:• writing overall control flow and basic data analysis routines in a high-level language• calling into cl.oquence for performance-critical sections (can be annoying)

The State of Scientific Programming Tomorrow

Page 39: [Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

OpenCL: The Good Parts

Cyrus OmarComputer Science DepartmentCarnegie Mellon Universityhttp://www.cs.cmu.edu/~comar/

Current Status:• Everything works, just need to clean out some cobwebs.• It will be available at http://cl.oquence.org/ soon (May).• If you want to use it today, email me ([email protected]).• Join clq-announce on Google Groups for release announcement.• Paper will be in submission shortly.