Transcript
Page 1: Kokkos implementation of Albany: a performance-portable ...sc14.supercomputing.org/.../post244s2-file2.pdfSandiaNaonal!Laboratories!is!amul(3program!laboratory!managed!and!operated!by!SandiaCorporaon,!a

 Sandia  Na(onal  Laboratories  is  a  mul(-­‐program  laboratory  managed  and  operated  by  Sandia  Corpora(on,  a  

wholly  owned  subsidiary  of  Lockheed  Mar(n  Corpora(on,  for  the  U.S.  Department  of  Energy’s  Na(onal  Nuclear  Security  Administra(on  under  contract  DE-­‐AC04-­‐94AL85000.  

 

Fully  Implicit  Solu(on  Algorithm:  •  Finite  Element  Assembly  

•  ~50%  CPU  (me  •  Linear  Solve  ILU  +  CG  

•  ~50%  CPU  Time  •  Problem  Size,  up  to:  

•  1.12B  Unknowns  •  16384  Cores  on  Hopper  

Introduc)on  

Sandia  Na)onal  Laboratories  Irina Demeshko, H. Carter Edwards, Michael A. Heroux, Eric T. Phipps, Andrew G. Salinger , Roger P. Pawowski Sandia  Na(onal  Laboratories,  New  Mexico,  87185  

Kokkos implementation of Albany: a performance-portable finite element application  

SAND2014-18787C  

Kokkos  programming  model  

Implementa)on  algorithm:  

Performance  Results*  

Ice  Sheet  Simula)on  Code  (FELIX)  

Problem:  Modern  HPC  applica(ons,  Finite  Element  Assembly  codes  in  par(cular,  need  to  be  run  on  many  different  pla^orms  and  performance  portability  has  become  a  cri(cal  issue:  parallel  code  needs  to  be  executed  correctly  and  performant  despite  varia(on  in  the  architecture,  opera(ng  system  and  so_ware  libraries.  Approach:  Kokkos  programming  model  from  Trilinos:  C++  library,  which  provide  performance  portability  across  diverse  devices  with  different  memory  models.  Funding:  ASC  Exascale  Codesign  Project    

Computed  Surface  Velocity  [m/yr]  for  Greenland  Ice  Sheet  

We  are  developing  a  performance  portable  implementa)on  of  an  Ice  Sheet  simula)on  code*,  build  with  the  Albany  [3]  applica)on  development  environment  

Ø  Provides  portability  across  manycore  devices  (Mul(core  CPU,  NVidia  GPU,  Intel  Xeon  Phi  (poten(al:  AMD  Fusion)  )  

Ø  Abstract  data  layout  for  non-­‐trivial  data  structures  Ø  Uses  library  approach:  

Ø  Maximize  amount  of  user  (applica4on/library)  code  that  can  be  compiled  without  modifica4on  and  run  on  these  architectures    

Ø  Minimize  amount  of  architecture-­‐specific  knowledge  that  a  user  is  required  to  have    Ø  Performant:  Portable  user  code  performs  as  well  as  architecture-­‐specific  code    

Main  abstrac)ons:  Ø  Kokkos  executes  computa(onal  kernels  in  fine-­‐grain  data  parallel  within  an  Execu4on  

space.  Ø  Computa(onal  kernels  operate  on  mul(dimensional  arrays  (Kokkos::View)  residing  in  

Memory  spaces.    Ø  Kokkos  provides  these  mul(dimensional  arrays  with  polymorphic  data  layout,  similar  to  

the  Boost.Mul(Array  flexible  storage  ordering.  

Kokkos  [1]  - C++ library from Trilinos [2] to provide scientific and engineering codes with an intuitive manycore performance portable programming model.

Project  objec)ves:    •  Develop  a  robust,  scalable,  unstructured-­‐grid  finite  

element  code  for  land  Ice  dynamics.  •  Provide  sea  level  rise  predic(on  •  Run  on  new  architecture  machines  (hybrid  systems).  

*Funding  Source:  PISCEES  SciDAC  Applica(on  (BER  &  ASCR)      

Authors:  A.  Salinger,  I.  Kalashnikova,  M.  Perego,  R.  Tuminaro  [SNL],  S.  Price[LANL]  

1)  Replace  array  alloca(ons  with  Kokkos::Views    (in  Host  space)    Kokkos::View<ScalarType****,  Device>      jacobian("Jacobian",  worksetNumCells,  numQPs,  numDims,  numDims);  

2)  Replace  array  access  with  Kokkos::Views                        B[k][j]  =  jacobian(i0,  i1,  k,  j);                              Kokkos::deep_copy  (jacobian,  host_jacobian);  3)  Replace  func(ons  with  Functors,  run  in  parallel  on  Host                                            4)  Call  Kokkos::parallel_for<…>  (…)    over  the  number  of  elements  in  workset:                            5)  Set  Device  to  ‘Cuda’,  ‘OpenMP’  or  ‘Threads’  and  run  on    specified  Device    typedef  Kokkos::Cuda  Device;  or      typedef  Kokkos::OpenMP  Device;  or    typedef  Kokkos::Threads    Device;    

1  

for(int cell = 0; cell < worksetNumCells; cell++) { for(int qp = 0; qp < numQPs; qp++) { for(int row = 0; row < numDims; row++){ for(int col = 0; col < numDims; col++){ for(int node = 0; node < numNodes; node++){ jacobian(cell, qp, row, col) += coordVec(cell, node, row) *basisGrads(node, qp, col); } // node } // col } // row } // qp } // cell

Kokkos::parallel_for ( worksetNumCells, compute_jacobian<ScalarT, Device, numQPs, numDims, numNodes> (basisGrads, jacobian, coordVec));

template  <  typename  ScalarType,  clas  DeviceType,  int  numQPs_,  int  numDims_,  int  numNodes_  >  class  compute_jacobian    {        Array3  basisGrads_;        Array4  jacobian_;        Array3_const  coordVec_;    public:        typedef  DeviceType  device_type;        compute_jacobian(Array3  &basisGrads,    Array4  &jacobian,  Array3  &coordVec)                                                                        :  basisGrads_(basisGrads)  ,  jacobian_(jacobian),  coordVec_(coordVec){}  KOKKOS_INLINE_FUNCTION    void  operator  ()  (const  std::size_t  cell)  const    {      for(int  qp  =  0;  qp  <  numQPs_;  qp++)  {                  for(int  row  =  0;  row  <  numDims_;  row++){                        for(int  col  =  0;  col  <  numDims_;  col++){                              for(int  node  =  0;  node  <  numNodes_;  node++){                                      jacobian_(cell,  qp,  row,  col)  +=  coordVec_(cell,  node,  row)*basisGrads_(node,  qp,  col);                                }  //  node                          }  //  col                  }  //  row            }  //  qp    }    };    

Example: Compute Jacobian

Initial code:

Kokkos Implementation:

•  “Fused” means that all the Kokkos kernels are fused togeather, when in “Separated” version we have several independent kernels •  ** “Nvidia K20 GPU” and “Initial Implementation” results were obtained on Shannon, “Intel MIC” were obtained on Compton

References:  1)  Edwards,  H.  Carter,  Troy,  Chris(an  R.,  Sunderland,  Daniel  Kokkos:  Enabling  manycore  performance  portability  

through  polymorphic  memory  access  pa`erns.  Journal  of  Parallel  and  Distributed  Compu(ng,  2014.  2)  Salinger,  Andrew  G.  Component-­‐based  Scien4fic  applica4on  Development.,  December  01,  2012  3)  Willenbring,  James  M.,and  Michael  Allen  Heroux.  Trilinos  Users  Guide.,  August  01,  2003.    

Graph of Finite Element Assembly Kernels

Evalua7on  Environment:  Compton  (Intel  MIC  cluster)  :    42  nodes:  •   Two  8-­‐core  Sandy  Bridge  Xeon  E5-­‐2670  @  

2.6GHz  (HT  ac(vated)  per  node,  •   24GB  (3*8Gb)  memory  per  node,    •  Two  Pre-­‐produc(on  KNC  (Intel  MIC)  2  per  

node  (57  cores  per  each)  Shannon  (NVIDIA  GPU  cluster):  32  nodes:  

•  Two  8-­‐core  Sandy  Bridge  Xeon  E5-­‐2670  @  2.6GHz  (HT  deac(vated)  per  node,  

•  128GB  DDR3  memory  per  node,  •  2x  NVIDIA  K20x  per  node    

 

Device:  Kokkos::parallel_for    over  the  #  of  elements  in  workset  

Copy  solu(on  vector  to  the  Device  

Copy  residual  vector  to  the  Host  

Loop  over  the  number  of  worksets  

0.0000001  

0.000001  

0.00001  

0.0001  

0.001  

10   100   1000   10000  

)me/#o

f  elemen

ts  

(sec)  

#  of  elements  

NVIDA  K20  

Intel  Xeon  Phi  Intel  Xeon  

Albany MiniDriver test code

Albany FELIX code

Performance evaluation results for the Kokkos implementation of Albany/FELIX code: weak scalability comparison for Serial, CUDA and Intel Xeon Phi implementations. a)-b) Performance results for Residual

computations. c) Performance results for Jacobian computations. d) Total execution time for Fiinite Element Assembly in Felix code.

0  5  

10  15  20  25  30  

500   5500   10500  

)me,sec  

#  of  elements  per  Workset  

Residual+Jacobian  

CUDA   Intel  Xeon  Phi    Serial  

0  

2  

4  

6  

8  

500   2500   4500   6500   8500   10500  

)me,  se

c  

#  of  elements  per  Workset  

Residual  

CUDA   Intel  Xeon  Phi    Serial  

0  

5  

10  

15  

20  

500   2500   4500   6500   8500   10500  

)me,sec  

#  of  elements  per  Workset  

Jacobian  

CUDA   Intel  Xeon  Phi    Serial  

0  

0.2  

0.4  

0.6  

0.8  

1  

500   5500   10500  

)me,  se

c  

#  of  elements  per  Workset  

Residual  

CUDA   Intel  Xeon  Phi  

a) b)

c) d)

Top Related