supporting diverse parallel models in the trilinos library · 2012. 3. 8. · 18 managed by...

Supporting Diverse Parallel Models in the Trilinos Library

Chris Baker Computational Engineering and Energy Studies

Oak Ridge National Laboratory, USA

MS 42: Parallel Programming Models, Algorithms and Frameworks for Scalable Manycore Systems SIAM Parallel Processing 2012 February 15-17, Savannah, GA

2 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library

Collaborators

• Oak Ridge National Laboratory –  Ross Bartlett

• Sandia National Laboratories –  Mike Heroux –  Mark Hoemmen –  Alan Williams –  Carter Edwards

• École Polytechnique Fédérale de Lausanne –  Radu Popescu


Dominant Scientific Library Paradigm

•  Library provides a specific capability. –  Apps can grab the data in order to expand functionality.

•  In an MPI-only scenario, expansion comes via domain-specific serial kernels coded by domain specialists. –  i.e., not doing any shared-memory programming

• With a single memory pool, data easily shared between library and app.

• With a single target architecture, compilation is relatively simple. –  Use any language for which you have a compiler. –  Mechanisms exist for mixed language capability.


Enter the Hybrid Parallel Environment

•  The path to exascale apparently requires addressing many-core. –  LANL RoadRunner: Cell BE and multi-core CPUs –  Tianhe 1A: NVIDIA GPUs and multi-core CPUs –  “K” Computer: simply consists of nodes of 8-core CPUs –  TACC Stampede: dual octo-core CPUs and Intel Knight’s Corner –  OLCF Titan/Cray XK6: one NVIDIA GPU per 12-core AMD CPU

• Ditch the assumptions of the previous slide/paradigm. –  We must investigate other parallel programming models –  We must revisit app/library relationship. –  We may need to consider other programming languages. –  Portability is more challenging than recently.


Numerous Considerations • Parallel Programming Model

–  MPI-only is the status quo for a large number of codes. •  Well-defined message passing API is an optimization target for vendors •  Users write serial, portable code

–  MPI-plus is where many codes are going. •  e.g., MPI+OpenMP, MPI+CUDA, MPI+directives •  Explicit two-level shared/distributed hybrid.

• Programming Language –  Programmer productivity is rooted in languages and APIs. –  C++, Fortran, OpenCL, CUDA offer different levels of expressiveness.

•  Library Extension –  “Grab the data and run” extension model requires addressing parallelism. –  Intrusive modification to a living library is untenable.


Challenges • MPI-only not enough

–  Need to port: it doesn’t work for accelerators. –  Inefficient: it misses a lot of shared-memory benefits.

• MPI-plus can entail significant work –  We want to minimize the number of code bases. –  We want to minimize the effort to add a new code base.

•  Language issues –  Many APIs require a particular language. –  Developers resent being told what language to use.

•  Lib/User interface issues –  Extending the library should not introduce serial bottlenecks. –  Shouldn’t require users to be shared-memory API experts.


Some approaches in Stage 2 Trilinos

•  Templated C++ code –  Templating data allows more efficient use of cache and bandwidth. –  Templating data expands capability (e.g., integer limit, complex)

• Generic shared memory parallel node –  Kokkos provides shared memory parallel node API –  Interface to numerous APIs via template metaprogramming layer

• Hybrid programming model –  Hybrid programming skeletons to support most common patterns –  Expose models for high-productivity, performance-portable apps

• Non-intrusive modification of structures and algorithms –  Expose the SMP node to apps; enable node-optimized kernels.


Kokkos and Tpetra Packages

• Kokkos is an API for shared-memory parallel nodes. –  Provides parallel_for and parallel_reduce skeletons –  Memory model addresses challenge of accelerator memory –  Provides reference linear algebra kernels –  Currently supports multiple shared-memory APIs:

•  ThreadPool Interface (TPI, a Trilinos pthreads package) •  Intel Threading Building Blocks (TBB) •  NVIDIA CUDA-capable GPUs (via Thrust) •  OpenMP New! implemented by Radu Popescu/EPFL

•  Tpetra is a distributed linear algebra library. –  Heavily exploits templated C++ –  Employs hybrid (distributed + shared) parallelism via Kokkos


Programming Heterogeneous Clusters

• Kokkos handles shared-memory. •  Tpetra handles communication between nodes.

–  How do we handle heterogeneous multi-core architectures?

• Multiple disjoint memories è distributed memory –  We have significant tools built around this model.

• One MPI process per shared-memory pool. –  Have to be even more careful with communication than before.

• A lot can be done with a two-level hybrid model. •  Templated classes differentiate node types. • Emulate MPI: identify common patterns, provide skeletons.


Tpetra Hybrid Parallelism

•  The typical Tpetra computational kernel concerns: 1)  member data structures 2)  calls to Kokkos NodeAPI for shared-memory programming 3)  calls to a communication for message passing

e.g., Tpetra::Vector::norm1()

(1) internal class data Scalar *x; int N;

(2) call the Kokkos NodeAPI DotOp<Scalar> op(x); lcl = node.parallel_for( 0, N, op );

(3) call the Comm gbl = comm.reduceAll( lcl, SUM );

• Extending library functionality can be done via external input at these three junctions.


Tpetra Vector Methods • Set of stand-alone non-member methods, e.g.:

–  unary_transform<UOP>(Vector &v, UOP op) –  binary_transform<BOP>(Vector &v1, const Vector &v2, BOP op) –  reduce<G>(const Vector &v1, const Vector &v2, G op_glob)

• Kernel level provides maximal expressiveness, but coarser levels brings convenience. // single-prec dot() with double-prec accumulator via custom kernel result = reduce( *x, *y, myDotProductKernel<float,double>() ); // Or a composite adaptor and standard functors result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( std::multiplies<float>(), std::plus<double>()) ); // Or using inline functors via C++11 lambda functions result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( [](float x, float y) {return x*y;} , [](double a, double b){return a+b;} ); // Or using a convenience macro to generate all of that result = REDUCE2( x, y, x*y, ZeroOp<float>, std::plus<double>() );


Easy Parallel Algorithm Development

for (k=0; k<numIters; ++k) { A->apply( *p, *Ap ); // Ap = A*p S pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<S>, plus<S>() ); // p'*Ap const S alpha = rr / pAp; // alpha = r’*r/p’*Ap BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p S rrold = rr; rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused kernels r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<S>, plus<S>() ); // sum r'*r const S beta = rr / rrold; // beta = r’*r/old(r’*r) BINARY_TRANSFORM( p, r, r + beta*p); // p = z + beta*p }

•  Inline templated hybrid-parallel conjugate gradient. –  Fun game: Find the MPI or threading!


Example: Recursive Multi-Prec. CG for (k=0; k<numIters; ++k) { A->apply(*p,*Ap); // Ap = A*p

T pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<T>, plus<T>()); // p'*Ap const T alpha = zr / pAp; BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p BINARY_TRANSFORM( rold, r, r ); // rold = r T rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused: r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<T>, plus<T>() ); // sum r'*r

recursiveFPCG<TS::next,LO,GO,Node>(out,db_T2); // recurse

auto plusTT = make_pair_op<T,T>(plus<T>());

pair<T,T> both = REDUCE3( z, r, rold, // fused: make_pair( z*r, z*rold ), // z'*r, z'*r_old ZeroPTT, plusTT ); const T beta = (both.first - both.second) / zr; zr = both.first; BINARY_TRANSFORM( p, z, z + beta*p ); // p = z + beta*p }


Example: Simple CG

• Problem dimension 5M •  500 iterations • Double precision arithmetic • MPI + TBB parallel node •  #threads = #mpi x #tbb

•  invocation like: mpirun -np 4 ./driver.exe --machine-file=tbb4.xml

1 2 4 8 16

RunA

me (log sec)

Total number of threads

MPI 1

MPI 2

MPI 4

MPI 8

MPI 16


Example: Simple CG

• Problem dimension 512K •  125 iterations • Quad-double precision • MPI + TBB parallel node •  #threads = #mpi x #tbb

• Same codebase, simply instantiated on qd_real instead of double.

1 2 4 8 16

RunA

me (log sec)

Total number of threads

MPI 1

MPI 2

MPI 4

MPI 8

MPI 16


Example: Recursive Multi-Prec. CG TBBNode initializing with numThreads == 2 TBBNode initializing with numThreads == 2 Running test with Node==Kokkos::TBBNode on rank 0/2 Beginning recursiveFPCG<qd_real> Beginning recursiveFPCG<dd_real> |res|/|res_0|: 1.269903e-14 |res|/|res_0|: 3.196573e-24 |res|/|res_0|: 6.208795e-35 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 2.704682e-32 Beginning recursiveFPCG<dd_real> |res|/|res_0|: 4.531185e-09 |res|/|res_0|: 6.341084e-20 |res|/|res_0|: 8.326745e-31 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 3.661388e-58 Leaving recursiveFPCG<qd_real> after 2 iterations.


Example: Recursive Multi-Prec. CG

• Problem: Oberwolfach/gyro • N=17K, nnz=1M •  qd_real / dd_real / double • MPI + TBB parallel node •  #threads = #mpi x #tbb • Solved to over 60 digits • Around 99.9% of time spent

in double precision computation.

• Single codebase. 4 8 16

qd_real MPI 1

MPI 2

MPI 4

MPI 8

MPI 16

4 8 16

dd_real MPI 1

MPI 2

MPI 4

MPI 8

MPI 16


Problems With Generic Kernels

• Generic kernels are not always successful: –  e.g., CRS mat-vec on GPUs is sub-optimal

• Different kernel may need different data structure. • We want vendors and researchers to be able to substitute

kernels into our library. • Solution #1 treats the kernel as a first-class object.

–  It is also a template parameter, potentially informing the structure of the local data.

• Solution #2 allows a class to be “specialized” to a particular platform, non-intrusively.


Kernel-Agnostic Sparse Matrix

class CrsMatrix<Scalar,Ord,Node,Matvec> { Comm comm; typename Matvec::rebind<Scalar>::type lclMatVecOp; typename Matvec::matrix<Scalar,Ord,Node>::type lclMatrix; }; CrsMatrix::fillComplete() { // ... use comm to communicate non-local entries lclMatrix.fill( ... ); lclMatVecOp.submitEntries( lclMatrix ); } CrsMatrix::multiply(Vector x, Vector y) { // ... use comm to perform exchange on x Kokkos::Vector lclx = x.getLocalVector(); Kokkos::Vector lcly = y.getLocalVector(); lclMatVecOp.apply(lclx, lcly); // ... use comm to perform exchange on y }


Specializations for Fine-Tuning

• Metaprogramming-based generic node is not perfect. –  Some APIs not amenable to this approach (e.g., OpenCL). –  We don’t want to have expose every kernel like for mat-vec

• You could hack up the library with #ifdefs. –  This is the main benefit of FOSS. –  But once you touch it, you own it. And upgrades are hard.

•  Template specializations provide a non-intrusive means for augmenting/modifying library capability.

class Tpetra::Vector<double,int,int,OpenCLNode> { // manual implementation for double/int under OpenCL }; class Tpetra::Vector<float,int,int,OpenCLNode> { // manual implementation for float/int under OpenCL };


Conclusion

• C++ templates and metaprogramming are being used in Trilinos to define a programming model that: –  provides support for research into efficient solvers –  allows user-authored serial code to be executed in hybrid parallel

on heterogeneous platforms –  provide non-intrusive modification/extension of library by users,

researchers and vendors.

•  The goal is to optimize programmer efficiency without significant performance sacrifices.

•  This is largely an experimental capability, deployed in only parts of the library.


appendix


Tpetra Operator Methods •  Tpetra Reduction/Transformation Interface provides

convenience methods/macros for applying user Kokkos kernels to Tpetra Vectors/MultiVectors. RCP< Tpetra::Map<LO,GO,Node> > domMap, rngMap, rowMap, colMap; RCP< Tpetra::Import<LO,GO,Node> > importer = ...; RCP< Tpetra::Export<LO,GO,Node> > exporter = ...; MyKernel<T,LO> kern(...); RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::kernelOp<T>(kern,domMap,rngMap,importer,exporter); op->apply(x, y);

• Also wrappers for applying general functors. –  e.g.: simple diagonal operator using a C++11 lambda function

RCP< Tpetra::Map<LO,GO,Node> > map; RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::binaryOp<T>( [](T, T x) {return 2.0 * x;} , map ); op->apply(x, y);


Tool: Tpetra HybridPlatform

• Encapsulate main in a templated class method:

•  HybridPlatform maps the communicator rank to the Node type, instantiates a node and the user routine:

template <class Node> class myMainRoutine { static void run(ParameterList &runParams, const RCP<const Comm<int> > &comm, const RCP<Node> &node) { // do something interesting } };

int main(...) { Comm<int> comm = ... ParameterList machine_file = ... // instantiate appropriate node and myMainRoutine Tpetra::HybridPlatform platform( comm , machine_file ); platform.runUserCode< myMainRoutine >(); return 0; }


hostname0

HybridPlatform Machine File

<ParameterList> <ParameterList name="%2=0"> <Parameter name="NodeType" type="string" value="Kokkos::ThrustGPUNode"/> <Parameter name="Verbose" type="int" value="1"/> <Parameter name="Device Number" type="int" value="0"/> <Parameter name="Node Weight" type="int" value="4"/> </ParameterList> <ParameterList name="%2=1"> <Parameter name="NodeType" type="string" value="Kokkos::TPINode"/> <Parameter name="Verbose" type="int" value="1"/> <Parameter name="Num Threads" type="int" value="15"/> <Parameter name="Node Weight" type="int" value="15"/> </ParameterList> </ParameterList>

ThrustGPUNode TPINode

rank 0 rank 1

hostname1

ThrustGPUNode TPINode

rank 2 rank 3 ...

round-‐robin assignment interval assignment explicit assignment default

%M=N [M,N] =N default


Refresher: Kokkos Parallel Constructs • Parallel for: execute loop iterations in parallel • User-defined struct (work-data pair) contains:

–  the necessary data and execute(int iter)

• Parallel reduce: reduce implicit set of elements in parallel via user-specified associative binary operation –  typedef ReductionType –  ReductionType identity() –  ReductionType generate(int i) –  ReductionType reduce(ReductionType a, ReductionType b)

•  Template meta-programming fuses generic loop skeleton with user data and kernel specifications.

Node::parallel_for <WDP>(int beg, int end, WDP args); Node::parallel_reduce<WDP>(int beg, int end, WDP args);


Kokkos parallel_for example

• Consider simple vector axpy:

template <class Scalar> struct AxpyOp { Scalar alpha; const Scalar *x; Scalar *y; inline void execute(int i) { y[i] += alpha * x[i]; } };

AxpyOp<double> daxpy( ... ); Node::parallel_for(0,N,daxpy); AxpyOp<complex<float> > caxpy( ... ); Node::parallel_for(0,N,caxpy);

y = α ∗ x + y


Kokkos parallel_reduce example

• Consider real-valued vector inner product:

template <class Scalar> struct DotOp { const Scalar *x, *y; typedef Scalar ReductionType; Scalar identity() { return 0; } Scalar generate(int i) { return x[i]*y[i]; } Scalar reduce(Scalar a, Scalar b) { return a+b; } };

DotOp<float> fdot( ... ); float f = Node::parallel_reduce(0,N,fdot); DotOp<qd_real> qddot( ... ); qd_real q = Node::parallel_reduce(0,N,qddot);

α = xT y


Some Ugly Details

• Host compiler: implicit instantiation handles coupling –  important to use inline/static whenever possible

• Device compiler (nvcc): need explicit instantiation 1.  put explicit instantiations in .cu file:

2.  compile via nvcc:

–  nvcc supports templates and template meta-programming J –  OpenCL does not (yet?) L

#include "Kokkos_ThrustGPUNode.cuh” // Node routines, in CUDA #include "TestOps.hpp” // Kernels, in C template void Kokkos::ThrustGPUNode::parallel_for<InitOp<int> > (int, int, InitOp<int>);

prompt> nvcc -c -o libkernels_cuda.a exp_inst_cuda_kernels.cu

supporting diverse parallel models in the trilinos library · 2012. 3. 8. · 18 managed by...

Documents