designing architecture-aware library using boost.proto
DESCRIPTION
HAMM and Meetng C++ Talk on our model of architecture description for C++ template metaprogramming.TRANSCRIPT
Designing Architecture-aware Library using Boost.Proto
Joel Falcou, Mathias GaunardPierre Esterie, Eric
Jourdanneau
LRI - University Paris Sud XI - CNRS
13/03/2012
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
Context
In Scientific Computing ...
� there is Scientific
� Applications are domain driven� Users 6= Developers� Users are reluctant to changes
� there is Computing
� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available
The ProblemPeople using computers to do science want to do science first.
2 of 34
The Problem and how to solve it
The Facts� The ”Library to bind them all” doesn’t exist (or we should have it already)
� New architectures performances are underused due to their inherent complexity
� Few people are both experts in parallel programming and SomeOtherField
The Ends� Find a way to shield users from parallel architecture details
� Find a way to shield developpers from parallel architecture details
� Isolate software from hardware evolution
The Means� Generic Programming
� Template Meta-Programming
� Embedded Domain Specific Languages
3 of 34
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
4 of 34
What’s NT2 ?
A Scientific Computing Library
� Provide a simple, Matlab-like interface for users
� Provide high-performance computing entities and primitives
� Easily extendable
A Research Platform
� Simple framework to add new optimization schemes
� Test bench for EDSL development methodologies
� Test bench for Generic Programming in real life projects
5 of 34
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
The NT2 API
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities
� 300+ functions usable directly either on table or on any scalar values as inMatlab
How does it works
� Take a .m file, copy to a .cpp file
� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the file and link with libnt2.a
6 of 34
Matlab you said ?
1 R = I(:,:,1);
2 G = I(:,:,2);
3 B = I(:,:,3);
4
5 Y = min(abs (0.299.*R+0.587.*G+0.114.*B) ,235);
6 U = min(abs ( -0.169.*R -0.331.*G+0.5.*B) ,240);
7 V = min(abs (0.5.*R -0.419.*G -0.081.*B) ,240);
7 of 34
Now with NT2
1 table <double > R = I(_,_,1);
2 table <double > G = I(_,_,2);
3 table <double > B = I(_,_,3);
4 table <double > Y, U, V;
5
6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);
7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);
8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);
8 of 34
Now with NT2
1 table <float ,settings(shallow_ ,of_size_ <N,M>)> R = I(_,_,1);
2 table <float ,settings(shallow_ ,of_size_ <N,M>)> G = I(_,_,2);
3 table <float ,settings(shallow_ ,of_size_ <N,M>)> B = I(_,_,3);
4 table <float ,of_size_ <N,M> > Y, U, V;
5
6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);
7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);
8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);
8 of 34
Some real life applications
9 of 34
NT2 before Boost.Proto
EDSL Core
� Code base was 11 KLOC
� 8KLOC was dedicated to the various Expression Template glue
� 3KLOC of actual smart stuff
Architectural support
� Altivec and SSE2 extension support
� Some vague pthread support
� Efforts were stagnant: adding a simple feature required multipel KLOC of codeto change
10 of 34
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
11 of 34
Embedded Domain Specific Languages
What’s an EDSL ?
� DSL = Domain Specific Language
� Declarative language, easy-to-use, fitting the domain
� EDSL = DSL within a general purpose language
EDSL in C++
� Relies on operator overload abuse (Expression Templates)
� Carry semantic information around code fragment
� Generic implementation become self-aware of optimizations
12 of 34
Expression Templates
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
13 of 34
Boost.Proto
What’s Boost.Proto
� EDSL for defining EDSLs in C++
� Generalize Expression Templates
� Easy way to define and test EDSL
� EDSL = some Grammar Rules + some Semantic
Boost.Proto Benefits
� Fast development process
� Compiler guys are at ease : Grammar + Semantic + Code Generation process
� Easily extendable through Transforms
� Easy to handle : MPL and Fusion compatible
14 of 34
Why Boost.Proto ?
Benefits
� Scalability: Keep EDSL glue code size low
� Extensibility: Simplify new optimizations design
� Debugging: Allow for meaningful error reporting
Remaining Challenges
� Provide a generic ”compilation” process
� Select the best function definition w/r to hardware
� Allow for non-intrusive new pass definition
15 of 34
State of the art EDSL
Usual approaches and results
� Eigen : supports SIMD arithmetic, thread only algorithms
� uBLAS : parallelism comes from potential back-end
What’s the problem ?
� Hardware consideration comes after ET code
� Retro-fitting parallelism is hard
� Need of a hardware-aware EDSL
� Back to Generative Programming
16 of 34
Principles of Generative Programming
Domain SpecificApplication Description
Generative Component Concrete Application
Translator
Parametric Sub-components
17 of 34
Generative Programming v2
18 of 34
Our Approach
Principles
� Segment EDSL evaluation in phases
� Each phases use Proto transforms to advance code generation
� Hardware specification are active element in function overload
� Use Generalized Tag Dispatching (GTD) to select proper functionimplementation
Advantages
� Hardware can influence part or whole evaluation process
� Proto transforms syntax are easy to use
� GTD increases opportunity for optimizations
19 of 34
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
20 of 34
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
20 of 34
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
1 namespace tag { struct plus_ : elementwise_ <plus_ > {}; }
2
3 functor <tag::plus_ > callee;
4 callee( some_a , some_b );
20 of 34
Generalized Tag Dispatching
Principles
� Tag dispatching use types tag to overload functions
� Classical example : standard Iterators
� We augment the system with:
� Tag computed from the function identifier� Tag computed from the hardware configuration
1 auto functor <Tag ,Site >:: operator ()(A&& args ...)
2 {
3 dispatch < hierarchy_of <Site >:: type
4 , hierarchy_of <Tag >:: type(hierarchy_of <A>:: type ...)
5 > callee;
6 return callee( args ... );
7 }
20 of 34
Generalized Tag Dispatching
1 template <class AO>
2 strutc dispatch < tag::sse_
3 , tag::plus_( simd_ < real_ < float_ <A0> > >
4 , simd_ < real_ < float_ <A0 > > >
5 )
6 >
7 {
8 A0 operator(A0&& a, A0&& b)
9 {
10 return _mm_add_ps( a, b );
11 }
12 };
21 of 34
Generalized Tag Dispatching
1 template <class S,class R, class L, class N, class AO >
2 strutc dispatch <openmp_ <S>,reduction_ <R,L,N>(ast_ <A0 >)>
3 {
4 A0:: value_type operator(A0&& ast)
5 {
6 auto result = functor <N,S>(as_ <A0::value_type >());
7 #pragma omp parallel
8 {
9 auto local = functor <N,S>(as_ <A0:: value_type >());
10 #pragma omp for nowait
11 for(int i=0;i<size(ast);++i)
12 functor <R,S>()(local , ast(i) );
13
14 #pragma omp critial
15 functor <L,S>()(result , local );
16 }
17
18 return result;
19 }
20 };
21 of 34
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
Multipass EDSL Code Generation
Principles
� Don’t walk the AST directly
� Capture where hardware introduces extension points
� Use GTD to select proper phase implementation
Structure
� Optimize : AST to AST transforms
� Schedule : AST to Forest of Code Generator transforms
� Run : Code Generator to Code transforms
22 of 34
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
Multipass EDSL Code Generation
Optimize
� AST Pattern Matching using Proto grammar
� Replace large sequence of operation by equivalent, faster kernels
� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
� Segment AST into loop-compatible loop functors
� Use AST node hierarchy (and GTD) for splitting
Run
� Generate code for each AST int he scheduled forest
� Potentially generate runtime calls for runtime optimization
� Classical EDSL code generation happen here on the correct hardware
23 of 34
The new NT2 Structure
State of the code base
� Expression Template handling 200-300LOC
� GTD handling : 1KLOC
� Actual features and function implementation : 10KLOC
Evolution process
� Adding a new site 100-500LOC
� Adding a function 10-100 LOC
� Time for complete new architecture support : 1 to 3 weeks
24 of 34
The new NT2 Structure
24 of 34
Some results
Support for OpenMP
� 1 build system script
� 4 new overloads for the run phase
� 1 new allocator to prevent false sharing
Support for Next Gen SIMD
� AVX : 2 weeks of work, functionnal
� AVX2 : prototype ready, work in simulator
� LarBn : prototype ready, work in simulator
25 of 34
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
26 of 34
OpenCL in action
Principle
� Standard for manycore/multicore systems
� Use runtime kernel and JIT Compilation
Example1 const char* OpenCLSource [] = {
2 "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b,unsigned
int nb)",
3 "{",
4 " // Index of the elements to add \n",
5 " unsigned int n = get_global_id (0);",
6 " // Sum the n’th element of vectors a and b and store in c \n",
7 "c[n]=0;",
8 " for (int i=0;i<nb;i++) c[n] += (a[n]+b[n]);",
9 "}"
10 };
27 of 34
From Proto to OpenCL
The Schedule Pass
� Split AST as usual on function hierarchy
� Replace each AST by a functor generating the equivalent OpenCL
� Compute a static hash of the tree to remember already compiled expressions
The Run pass
� Stream data at beginning of each kernel
� Handle generic calls and retrieval of compiled kernel
� Fire asynchronous kernel calls over the forest
28 of 34
From Proto to OpenCL
The Schedule Pass
� Split AST as usual on function hierarchy
� Replace each AST by a functor generating the equivalent OpenCL
� Compute a static hash of the tree to remember already compiled expressions
The Run pass
� Stream data at beginning of each kernel
� Handle generic calls and retrieval of compiled kernel
� Fire asynchronous kernel calls over the forest
28 of 34
Support for openCL
NT2 Example1 int main()
2 {
3 int nbpoints =512000;
4 table <float , settings( _1D ) > xx(ofSize(nbpoints));
5 table <float , settings( _1D ) > yy(ofSize(nbpoints));
67 for (int i=1;i<= nbpoints;i++)
8 {
9 xx(i) = -nbpoints /2+i;
10 yy(i) = 0.0;
11 }
12 yy=100* sin (8*3.14* xx /512000);
1314 std::cout <<xx(1) <<" "<<yy(1);
15 }
29 of 34
Support for openCL
NT2 Example1 int main()
2 {
3 int nbpoints =512000;
4 table <float , settings( _1D , nt2::gpu ) > xx(ofSize(nbpoints));
5 table <float , settings( _1D , nt2::gpu ) > yy(ofSize(nbpoints));
67 for (int i=1;i<= nbpoints;i++)
8 {
9 xx(i) = -nbpoints /2+i;
10 yy(i) = 0.0;
11 }
12 yy=100* sin (8*3.14* xx /512000);
1314 std::cout <<xx(1) <<" "<<yy(1);
15 }
29 of 34
Preliminary results for openCL
Generating some signals1 for (int i=1;i<= nbpoints;i++)
2 {
3 xx(i) = -nbpoints /2+i;
4 yy(i) = 0.0;
5 }
67 yy=100* sin (8*3.14* xx /512000);
8 yy=yy +0.0001* xx;
910 for (int ii=1;ii <512000; ii +=15000)
11 {
12 addGaussian(xx ,yy ,(float)ii ,1000);
13 yy=1-yy;
14 }
1516 yy=1-yy;
17 yy=yy*yy;
18 yy=yy*cos(xx);
30 of 34
Preliminary results for openCL
Generating some signals1 template <typename A>
2 void addGaussian(A &xx, A &yy , float centre , float largeur)
3 {
4 yy = yy + ( 100000/( val(largeur)*sqrt (2*3.1415)))
5 * exp(-(xx-val(centre))*(xx-val(centre))
6 /(2* val(largeur)*val(largeur)
7 )
8 );
9 }
30 of 34
Preliminary results for openCL
Generating some signals
CPU 1 core 520.2 ms
CPU 4 cores 141.3 ms
CPU 4 cores+SIMD 36.1 ms
GPU First Run 878.2 ms
GPU Other Run 13.1 ms
GPU Manual OCL 12.9 ms
� Compares favorably to ViennaCL and Accelereyes
30 of 34
Talk Layout
Introduction
NT2
Expression Templates v2.0
Supporting GPUs
Conclusion
31 of 34
Let’s round this up!
Parallel Computing for Scientist
� Parallel programmign tools are required
� EDSL are an efficient way to design such tools
� But hardware has to stay in the loop
Our claims
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system
� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability
32 of 34
Let’s round this up!
Parallel Computing for Scientist
� Parallel programmign tools are required
� EDSL are an efficient way to design such tools
� But hardware has to stay in the loop
Our claims
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system
� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability
32 of 34
Current and Future Works
Recent activity
� A public release of NT2 is planned ”soon”
� A Matlab to NT2 compiler has been designed
� Prototype for single source CUDA support in NT2
� PhD started on MPI support in NT2
What we’ll be cooking in the future
� std::future is the future
� Experimental research on NT2 for embedded systems
� Toward a global generic approach to parallelism
33 of 34
Current and Future Works
Recent activity
� A public release of NT2 is planned ”soon”
� A Matlab to NT2 compiler has been designed
� Prototype for single source CUDA support in NT2
� PhD started on MPI support in NT2
What we’ll be cooking in the future
� std::future is the future
� Experimental research on NT2 for embedded systems
� Toward a global generic approach to parallelism
33 of 34
Thanks for your attention