designing architecture-aware library using boost.proto

Designing Architecture-aware Library using Boost.Proto

Joel Falcou, Mathias GaunardPierre Esterie, Eric

Jourdanneau

LRI - University Paris Sud XI - CNRS

13/03/2012

Context

In Scientific Computing ...

� there is Scientific

� Applications are domain driven� Users 6= Developers� Users are reluctant to changes

� there is Computing

� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available

The ProblemPeople using computers to do science want to do science first.

2 of 34

The Problem and how to solve it

The Facts� The ”Library to bind them all” doesn’t exist (or we should have it already)

� New architectures performances are underused due to their inherent complexity

� Few people are both experts in parallel programming and SomeOtherField

The Ends� Find a way to shield users from parallel architecture details

� Find a way to shield developpers from parallel architecture details

� Isolate software from hardware evolution

The Means� Generic Programming

� Template Meta-Programming

� Embedded Domain Specific Languages

3 of 34

Talk Layout

Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion

4 of 34

What’s NT2 ?

A Scientific Computing Library

� Provide a simple, Matlab-like interface for users

� Provide high-performance computing entities and primitives

� Easily extendable

A Research Platform

� Simple framework to add new optimization schemes

� Test bench for EDSL development methodologies

� Test bench for Generic Programming in real life projects

5 of 34

The NT2 API

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 300+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

6 of 34

Matlab you said ?

1 R = I(:,:,1);

2 G = I(:,:,2);

3 B = I(:,:,3);

4

5 Y = min(abs (0.299.*R+0.587.*G+0.114.*B) ,235);

6 U = min(abs ( -0.169.*R -0.331.*G+0.5.*B) ,240);

7 V = min(abs (0.5.*R -0.419.*G -0.081.*B) ,240);

7 of 34

Now with NT2

1 table <double > R = I(_,_,1);

2 table <double > G = I(_,_,2);

3 table <double > B = I(_,_,3);

4 table <double > Y, U, V;

5

6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);

7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);

8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);

8 of 34

Now with NT2

1 table <float ,settings(shallow_ ,of_size_ <N,M>)> R = I(_,_,1);

2 table <float ,settings(shallow_ ,of_size_ <N,M>)> G = I(_,_,2);

3 table <float ,settings(shallow_ ,of_size_ <N,M>)> B = I(_,_,3);

4 table <float ,of_size_ <N,M> > Y, U, V;

5

6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);

7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);

8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);

8 of 34

Some real life applications

9 of 34

NT2 before Boost.Proto

EDSL Core

� Code base was 11 KLOC

� 8KLOC was dedicated to the various Expression Template glue

� 3KLOC of actual smart stuff

Architectural support

� Altivec and SSE2 extension support

� Some vague pthread support

� Efforts were stagnant: adding a simple feature required multipel KLOC of codeto change

10 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

11 of 34

Embedded Domain Specific Languages

What’s an EDSL ?

� DSL = Domain Specific Language

� Declarative language, easy-to-use, fitting the domain

� EDSL = DSL within a general purpose language

EDSL in C++

� Relies on operator overload abuse (Expression Templates)

� Carry semantic information around code fragment

� Generic implementation become self-aware of optimizations

12 of 34

Expression Templates

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

13 of 34

Boost.Proto

What’s Boost.Proto

� EDSL for defining EDSLs in C++

� Generalize Expression Templates

� Easy way to define and test EDSL

� EDSL = some Grammar Rules + some Semantic

Boost.Proto Benefits

� Fast development process

� Compiler guys are at ease : Grammar + Semantic + Code Generation process

� Easily extendable through Transforms

� Easy to handle : MPL and Fusion compatible

14 of 34

Why Boost.Proto ?

Benefits

� Scalability: Keep EDSL glue code size low

� Extensibility: Simplify new optimizations design

� Debugging: Allow for meaningful error reporting

Remaining Challenges

� Provide a generic ”compilation” process

� Select the best function definition w/r to hardware

� Allow for non-intrusive new pass definition

15 of 34

State of the art EDSL

Usual approaches and results

� Eigen : supports SIMD arithmetic, thread only algorithms

� uBLAS : parallelism comes from potential back-end

What’s the problem ?

� Hardware consideration comes after ET code

� Retro-fitting parallelism is hard

� Need of a hardware-aware EDSL

� Back to Generative Programming

16 of 34

Principles of Generative Programming

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

17 of 34

Generative Programming v2

18 of 34

Our Approach

Principles

� Segment EDSL evaluation in phases

� Each phases use Proto transforms to advance code generation

� Hardware specification are active element in function overload

� Use Generalized Tag Dispatching (GTD) to select proper functionimplementation

Advantages

� Hardware can influence part or whole evaluation process

� Proto transforms syntax are easy to use

� GTD increases opportunity for optimizations

19 of 34

Generalized Tag Dispatching

Principles

� Tag dispatching use types tag to overload functions

� Classical example : standard Iterators

� We augment the system with:

� Tag computed from the function identifier� Tag computed from the hardware configuration

20 of 34


Principles





1 namespace tag { struct plus_ : elementwise_ <plus_ > {}; }

2

3 functor <tag::plus_ > callee;

4 callee( some_a , some_b );

20 of 34


Principles





1 auto functor <Tag ,Site >:: operator ()(A&& args ...)

2 {

3 dispatch < hierarchy_of <Site >:: type

4 , hierarchy_of <Tag >:: type(hierarchy_of <A>:: type ...)

5 > callee;

6 return callee( args ... );

7 }

20 of 34


1 template <class AO>

2 strutc dispatch < tag::sse_

3 , tag::plus_( simd_ < real_ < float_ <A0> > >

4 , simd_ < real_ < float_ <A0 > > >

5 )

6 >

7 {

8 A0 operator(A0&& a, A0&& b)

9 {

10 return _mm_add_ps( a, b );

11 }

12 };

21 of 34


1 template <class S,class R, class L, class N, class AO >

2 strutc dispatch <openmp_ <S>,reduction_ <R,L,N>(ast_ <A0 >)>

3 {

4 A0:: value_type operator(A0&& ast)

5 {

6 auto result = functor <N,S>(as_ <A0::value_type >());

7 #pragma omp parallel

8 {

9 auto local = functor <N,S>(as_ <A0:: value_type >());

10 #pragma omp for nowait

11 for(int i=0;i<size(ast);++i)

12 functor <R,S>()(local , ast(i) );

13

14 #pragma omp critial

15 functor <L,S>()(result , local );

16 }

17

18 return result;

19 }

20 };

21 of 34

Multipass EDSL Code Generation

Principles

� Don’t walk the AST directly

� Capture where hardware introduces extension points

� Use GTD to select proper phase implementation

Structure

� Optimize : AST to AST transforms

� Schedule : AST to Forest of Code Generator transforms

� Run : Code Generator to Code transforms

22 of 34

Multipass EDSL Code Generation

Optimize

� AST Pattern Matching using Proto grammar

� Replace large sequence of operation by equivalent, faster kernels

� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)

Schedule

� Segment AST into loop-compatible loop functors

� Use AST node hierarchy (and GTD) for splitting

Run

� Generate code for each AST int he scheduled forest

� Potentially generate runtime calls for runtime optimization

� Classical EDSL code generation happen here on the correct hardware

23 of 34

The new NT2 Structure

State of the code base

� Expression Template handling 200-300LOC

� GTD handling : 1KLOC

� Actual features and function implementation : 10KLOC

Evolution process

� Adding a new site 100-500LOC

� Adding a function 10-100 LOC

� Time for complete new architecture support : 1 to 3 weeks

24 of 34

The new NT2 Structure

24 of 34

Some results

Support for OpenMP

� 1 build system script

� 4 new overloads for the run phase

� 1 new allocator to prevent false sharing

Support for Next Gen SIMD

� AVX : 2 weeks of work, functionnal

� AVX2 : prototype ready, work in simulator

� LarBn : prototype ready, work in simulator

25 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

26 of 34

OpenCL in action

Principle

� Standard for manycore/multicore systems

� Use runtime kernel and JIT Compilation

Example1 const char* OpenCLSource [] = {

2 "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b,unsigned

int nb)",

3 "{",

4 " // Index of the elements to add \n",

5 " unsigned int n = get_global_id (0);",

6 " // Sum the n’th element of vectors a and b and store in c \n",

7 "c[n]=0;",

8 " for (int i=0;i<nb;i++) c[n] += (a[n]+b[n]);",

9 "}"

10 };

27 of 34

From Proto to OpenCL

The Schedule Pass

� Split AST as usual on function hierarchy

� Replace each AST by a functor generating the equivalent OpenCL

� Compute a static hash of the tree to remember already compiled expressions

The Run pass

� Stream data at beginning of each kernel

� Handle generic calls and retrieval of compiled kernel

� Fire asynchronous kernel calls over the forest

28 of 34

Support for openCL

NT2 Example1 int main()

2 {

3 int nbpoints =512000;

4 table <float , settings( _1D ) > xx(ofSize(nbpoints));

5 table <float , settings( _1D ) > yy(ofSize(nbpoints));

67 for (int i=1;i<= nbpoints;i++)

8 {

9 xx(i) = -nbpoints /2+i;

10 yy(i) = 0.0;

11 }

12 yy=100* sin (8*3.14* xx /512000);

1314 std::cout <<xx(1) <<" "<<yy(1);

15 }

29 of 34

Support for openCL

NT2 Example1 int main()

2 {

3 int nbpoints =512000;

4 table <float , settings( _1D , nt2::gpu ) > xx(ofSize(nbpoints));

5 table <float , settings( _1D , nt2::gpu ) > yy(ofSize(nbpoints));

67 for (int i=1;i<= nbpoints;i++)

8 {


10 yy(i) = 0.0;

11 }

12 yy=100* sin (8*3.14* xx /512000);

1314 std::cout <<xx(1) <<" "<<yy(1);

15 }

29 of 34

Preliminary results for openCL

Generating some signals1 for (int i=1;i<= nbpoints;i++)

2 {


4 yy(i) = 0.0;

5 }

67 yy=100* sin (8*3.14* xx /512000);

8 yy=yy +0.0001* xx;

910 for (int ii=1;ii <512000; ii +=15000)

11 {

12 addGaussian(xx ,yy ,(float)ii ,1000);

13 yy=1-yy;

14 }

1516 yy=1-yy;

17 yy=yy*yy;

18 yy=yy*cos(xx);

30 of 34


Generating some signals1 template <typename A>

2 void addGaussian(A &xx, A &yy , float centre , float largeur)

3 {

4 yy = yy + ( 100000/( val(largeur)*sqrt (2*3.1415)))

5 * exp(-(xx-val(centre))*(xx-val(centre))

6 /(2* val(largeur)*val(largeur)

7 )

8 );

9 }

30 of 34


Generating some signals

CPU 1 core 520.2 ms

CPU 4 cores 141.3 ms

CPU 4 cores+SIMD 36.1 ms

GPU First Run 878.2 ms

GPU Other Run 13.1 ms

GPU Manual OCL 12.9 ms

� Compares favorably to ViennaCL and Accelereyes

30 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

31 of 34

Let’s round this up!

Parallel Computing for Scientist

� Parallel programmign tools are required

� EDSL are an efficient way to design such tools

� But hardware has to stay in the loop

Our claims

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system

� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability

32 of 34

Current and Future Works

Recent activity

� A public release of NT2 is planned ”soon”

� A Matlab to NT2 compiler has been designed

� Prototype for single source CUDA support in NT2

� PhD started on MPI support in NT2

What we’ll be cooking in the future

� std::future is the future

� Experimental research on NT2 for embedded systems

� Toward a global generic approach to parallelism

33 of 34

Thanks for your attention

designing architecture-aware library using boost.proto

Documents

table double b

table double g

nt2 apiprinciples table

nt21 table double r

table double y

nt21 table float

contextin scientic computing

matlab array behavior