kokkos: the c++ performance portability programming...
TRANSCRIPT
![Page 1: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/1.jpg)
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.
Kokkos:TheC++PerformancePortabilityProgrammingModel
Christian Trott ([email protected]), CarterEdwardsD.Sunderland, N.Ellingwood,D.Ibanez,S.Hammond, S.Rajamanickam,K.Kim,M.Deveci,M.Hoemmen,G.
Center forComputingResearch,SandiaNationalLaboratories,NMSAND2017-4935C
![Page 2: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/2.jpg)
NewProgrammingModels
§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only
§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,
C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...
§ Vendordecouplingdrivesexternaldevelopment
2
![Page 3: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/3.jpg)
NewProgrammingModels
§ HPCisataCrossroads§ DiversifyingHardwareArchitectures§ MoreparallelismnecessitatesparadigmshiftfromMPI-only
§ NeedforNewProgrammingModels§ PerformancePortability:OpenMP 4.5,OpenACC,Kokkos,RAJA,SyCL,
C++20?,…§ ResilienceandLoadBalancing:Legion,HPX,UPC++,...
§ Vendordecouplingdrivesexternaldevelopment
3
WhatisKokkos?Whatisnew?
Whyshouldyoutrustus?
![Page 4: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/4.jpg)
Kokkos:Performance,PortabilityandProductivity
DDR#
HBM#
DDR#
HBM#
DDR#DDR#
DDR#
HBM#HBM#
Kokkos#
LAMMPS# Sierra# Albany#Trilinos#
https://github.com/kokkos
![Page 5: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/5.jpg)
PerformancePortabilitythroughAbstraction
Kokkos
Execution Spaces (“Where”)
Execution Patterns (“What”)
Execution Policies (“How”)
- N-Level- Support Heterogeneous Execution
- parallel_for/reduce/scan, task spawn- Enable nesting
- Range, Team, Task-Dag- Dynamic / Static Scheduling- Support non-persistent scratch-pads
Memory Spaces (“Where”)
Memory Layouts (“How”)
Memory Traits
- Multiple-Levels- Logical Space (think UVM vs explicit)
- Architecture dependent index-maps- Also needed for subviews
- Access Intent: Stream, Random, …- Access Behavior: Atomic- Enables special load paths: i.e. texture
Parallel ExecutionData Structures
Separating of Concerns for Future Systems…
![Page 6: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/6.jpg)
CapabilityMatrix
6
Implementation Technique
Parallel Loops
Parallel R
eduction
Tightly N
ested L
oops
Non-tightly
Nested L
oops
Task Parallelism
Data A
llocations
Data Transfers
Advanced D
ata A
bstractions
Kokkos C++ Abstraction X X X X X X X XOpenMP Directives X X X X X X X -OpenACC Directives X X X X - X X -CUDA Extension (X) - (X) X - X X -OpenCL Extension (X) - (X) X - X X -C++AMP Extension X - X - - X X -Raja C++ Abstraction X X X (X) - - - -TBB C++ Abstraction X X X X X X - -C++17 Language X - - - (X) X (X) (X)Fortran2008 Language X - - - - X (X) -
![Page 7: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/7.jpg)
Example:Conjugent GradientSolver
§ SimpleIterativeLinearSolver§ ForexampleusedinMiniFE§ Usesonlythreemathoperations:
§ Vectoraddition(AXPBY)§ Dotproduct(DOT)§ SparseMatrixVectormultiply(SPMV)
§ DatamanagementwithKokkosViews:
7
View<double*,HostSpace,MemoryTraits<Unmanaged> >h_x(x_in, nrows);
View<double*> x("x",nrows);deep_copy(x,h_x);
![Page 8: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/8.jpg)
CGSolve:TheAXPBY
8
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
![Page 9: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/9.jpg)
CGSolve:TheAXPBY
9
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Parallel Pattern: for loop
![Page 10: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/10.jpg)
CGSolve:TheAXPBY
10
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
String Label: Profiling/Debugging
![Page 11: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/11.jpg)
CGSolve:TheAXPBY
11
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Execution Policy: do n iterations
![Page 12: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/12.jpg)
CGSolve:TheAXPBY
12
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Iteration handle: integer index
![Page 13: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/13.jpg)
CGSolve:TheAXPBY
13
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
Loop Body
![Page 14: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/14.jpg)
CGSolve:TheAXPBY
14
void axpby(int n, View<double*> z, double alpha, View<const double*> x,double beta, View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {z(i) = alpha*x(i) + beta*y(i);
});}
§ Simpledataparallelloop:Kokkos::parallel_for§ Easytoexpressinmostprogrammingmodels§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) {
for(int i=0; i<n; i++)z[i] = alpha*x[i] + beta*y[i];
}
![Page 15: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/15.jpg)
CGSolve:TheDotProduct
15
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
}
![Page 16: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/16.jpg)
CGSolve:TheDotProduct
16
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
} Parallel Pattern: loop with reduction
![Page 17: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/17.jpg)
CGSolve:TheDotProduct
17
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
} Iteration Index + Thread-Local Red. Varible
![Page 18: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/18.jpg)
CGSolve:TheDotProduct
18
double dot(int n, View<const double*> x, View<const double*> y) {double x_dot_y = 0.0;parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) {
sum += x[i]*y[i];}, x_dot_y);return x_dot_y;
}
§ Simpledataparallelloopwithreduction:Kokkos::parallel_reduce§ NontrivialinCUDAduetolackofbuilt-inreductionsupport§ Bandwidthbound§ SerialImplementation:
§ KokkosImplementation:
double dot(int n, const double* x, const double* y) {double sum = 0.0;for(int i=0; i<n; i++)
sum += x[i]*y[i];return sum;
}
![Page 19: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/19.jpg)
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
19
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
![Page 20: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/20.jpg)
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
20
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
Outer loop over matrix rows
![Page 21: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/21.jpg)
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
21
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
Inner dot product row x vector
![Page 22: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/22.jpg)
CGSolve:TheSPMV
§ Loopoverrows§ Dotproductofmatrixrowwithavector§ ExampleofNon-Tightlynestedloops§ Randomaccessonthevector(TexturefetchonGPUs)
22
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols, const double* A_vals, double* y, const double* x) {
for(int row=0; row<nrows; ++row) {double sum = 0.0;int row_start=A_row_offsets[row];int row_end=A_row_offsets[row+1];for(int i=row_start; i<row_end; ++i) {sum += A_vals[i]*x[A_cols[i]];
}y[row] = sum;
}}
![Page 23: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/23.jpg)
CGSolve:TheSPMV
23
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}
![Page 24: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/24.jpg)
CGSolve:TheSPMV
24
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}Team Parallelism over Row Worksets
![Page 25: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/25.jpg)
CGSolve:TheSPMV
25
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}Distribute rows in workset over team-threads
![Page 26: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/26.jpg)
CGSolve:TheSPMV
26
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
} Row x Vector dot product
![Page 27: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/27.jpg)
CGSolve:TheSPMV
27
void SPMV(int nrows, View<const int*> A_row_offsets, View<const int*> A_cols, View<const double*> A_vals, View<double*> y,View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDAint rows_per_team = 64;int team_size = 64;
#elseint rows_per_team = 512;int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy", TeamPolicy< Schedule< Static > >((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;const int last_row = first_row+rows_per_team<nrows? first_row+rows_per_team : nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),[&] (const int row) {const int row_start=A_row_offsets[row];const int row_length=A_row_offsets[row+1]-row_start;
double y_row;parallel_reduce(ThreadVectorRange(team,row_length),[=] (const int i,double& sum) {sum += A_vals(i+row_start)*x(A_cols(i+row_start));
} , y_row);y(row) = y_row;
});});
}
![Page 28: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/28.jpg)
CGSolve:Performance
28
0102030405060708090
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Perfo
rmance[G
flop/s]
NVIDIAP100/IBMPower8
OpenACC CUDA Kokkos OpenMP
0
10
20
30
40
50
60
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Performance[G
flop/s]
IntelKNL7250
OpenACC Kokkos OpenMP TBB(Flat) TBB(Hierarchical)
§ ComparisonwithotherProgrammingModels
§ Straightforwardimplementationofkernels
§ OpenMP4.5isimmatureatthispoint
§ Twoproblemsizes:100x100x100and200x200x200elements
![Page 29: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/29.jpg)
CustomReductionsWithLambdas
§ AddedCapabilitytodoCustomReductionswithLambdas§ Providebuilt-inreducersforcommonoperations
§ Add,Min,Max,Prod,MinLoc,MaxLoc,MinMaxLoc,And,Or,Xor,
§ Userscanimplementtheirownreducers§ ExampleMaxreduction:
29
double resultparallel_reduce(N,
KOKKOS_LAMBDA(const int& i, double& lmax) {
if(lmax < a(i)) lmax = a(i);},Max<double>(result));
0
100
200
300
400
500
600
K80 P100 KNL
Ban
dwid
th i
n G
B/s
Reduction Performance
Sum Max MaxLoc
![Page 30: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/30.jpg)
NewFeatures:MDRangePolicy§ Manypeopleperformstructuredgridcalculations
§ Sandia’scodesarepredominantlyunstructuredthough
§ MDRangePolicy introducedfortightlynestedloops§ Usecase correspondingtoOpenMPcollapseclause
§ Optionallysetiterationorderandtiling:
30
void launch (int N0, int N1, [ARGS]) {parallel_for(MDRangePolicy<Rank<3>>({0,0,0},{N0,N1,N2}),
KOKKOS_LAMBDA (int i0, int i1) {/*...*/});
}
MDRangePolicy<Rank<3,Iterate::Right,Iterate::Left>> ({0,0,0},{N0,N1,N2},{T0,T1,T2})
0
100
200
300
400
500
600
KNL P100
Gflo
p/s
Albany Kernel
Naïve Best Raw MDRange
![Page 31: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/31.jpg)
NewFeatures:TaskGraphs§ Taskdags areanimportantclassofalgorithmicstructures§ Usedforalgorithmswithcomplexdatadependencies
§ Forexampletriangularsolve
§ Kokkostaskingison-node§ Futurebased,notexplicitdatacentric(asforexampleLegion)
§ Tasksreturnfutures§ Taskscandependonfutures
§ Respawnoftaskspossible§ Taskscanspawnothertasks§ Taskscanhavedataparallelloops
§ I.e.ataskcanutilizemultiplethreadslikethehyperthreadsonacoreoraCUDAblock
31Carter Edwards S7253 “Task Data Parallelism” , Right after this talk.
![Page 32: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/32.jpg)
NewFeatures:PascalSupport§ PascalGPUsProvideasetofnewCapabilities
§ Muchbettermemorysubsystem§ NVLink (nextslide)§ Hardwaresupportfordoubleprecisionatomicadd
§ Generallysignificantspeedup3-5xoverK80forSandiaApps
32
0
20
40
60
80
100
120
140
160
4000 32000 256000 2048000
Mill
ion
Ato
mst
eps
per
Sec
ond
Number of Atoms
LammpsTersoff Potential
K40
K80
P100
K40 Atomic
K80 Atomic
P100 Atomic
0.001
0.01
0.1
1
10
100
1000
1 8 64 512 4096
Ban
dwid
th i
n G
B/s
Update Array Size
Atomic Bandwidth
![Page 33: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/33.jpg)
NewFeatures:HBMSupport§ NewarchitectureswithHBM:IntelKNL,NVIDIAP100§ Generallythreetypesofallocations:
§ PagepinnedinDDR§ PagepinnedinHBM§ Pagemigratable byOSorhardwarecaching
§ Kokkossupportsallthreeonbotharchitectures§ ForCuda backend:CudaHostPinnedSpace,CudaSpace andCudaUVMSpace§ E.g.:Kokkos::View<double*,CudaUVMSpace>a(”A”,N);
§ P100BandwidthwithandwithoutdataReuse
33
1
10
100
1000
0.125 0.25 0.5 1 2 4 8 16 32 64
Ban
dwid
th G
B/s
Problem Size [GB]
HostPinned 1
HostPinned 32
UVM 1
UVM 32
![Page 34: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/34.jpg)
UpcomingFeatures
§ SupportforOpenMP4.5+TargetBackend§ Experimentallyavailableongithub§ CUDAwillstaypreferredbackend§ MaybesupportforFPGAsinthefuture?§ HelpmaturingOpenMP4.5+compilers
§ SupportforAMDROCm Backend§ Experimentallyavailable ongithub§ MainlydevelopedbyAMD§ SupportforAPUsanddiscreteGPUs§ Expectmaturationfall2017
34
![Page 35: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/35.jpg)
BeyondCapabilities
§ UsingKokkosisinvasive,youcan’tjustswapitout§ Significantpartofdatastructuresneedtobetakenover§ Functionmarkingseverywhere
§ ItisnotenoughtohaveinitialCapabilities§ Robustnesscomesfromutilizationsandexperience
§ Differenttypesofapplicationandcodingstyleswillexposedifferentcornercases
§ Applicationsneedlibraries§ InteractionwithTPLssuchasBLASmustwork§ Manylibrarycapabilitiesmustbereimplemented intheprogramming
model
§ Applicationsneedtools§ ProfilingandDebuggingcapabilitiesarerequiredforseriouswork
35
![Page 36: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/36.jpg)
Timeline
36
Initial Kokkos: Linear Algebra for Trilinos
Restart of Kokkos: Scope now Programming Model
Mantevo MiniApps: Compare Kokkos to other Models
LAMMPS: Demonstrate Legacy App Transition
Trilinos: Move Tpetra over to use Kokkos Views
Multiple Apps start exploring (Albany, Uintah, …)
Sandia Multiday Tutorial (~80 attendees)
Sandia Decision to prefer Kokkos over other models
Github Release of Kokkos 2.0
Kokkos-Kernels and Kokkos-Tools Release
DOE Exascale Computing Project starts
2008
2011
2013
2012
2014
2016
2015
2017
![Page 37: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/37.jpg)
17"
0"2000"4000"6000"8000"10000"12000"14000"
FOM+(Z
/s)+
LULESH+Figure+of+Merit+Results+(Problem+60)+
HSW"1x16" HSW"1x32" P8"1x40"XL" KNC"1x224" ARM64"1x8" NV"K40"Higher"is"
Better"
Results"by"Dennis"Dinge,"Christian"Trott"and"Si"Hammond"
InitialDemonstrations(2012-2015)
37
§ DemonstrateFeasibilityofPerformancePortability§ DevelopmentofanumberofMiniApps fromdifferentsciencedomains
§ DemonstrateLowPerformanceLossversusNativeModels§ MiniApps areimplementedinvariousprogrammingmodels
§ DOETriLab Collaboration§ ShowKokkos worksfor
otherlabsapp§ Notethisishistoricaldata:
Improvementswerefound,RAJAimplementedsimilaroptimizationetc.
![Page 38: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/38.jpg)
TrainingtheUser-Base§ TypicalLegacyApplicationDeveloper
§ ScienceBackground§ MostlySerialCoding(MPIappsusuallyhavecommunicationlayerfew
peopletouch)§ Littlehardwarebackground,littleparallelprogrammingexperience
§ NotsufficienttoteachProgrammingModelSyntax§ Needtraininginparallelprogrammingtechniques§ Teachfundamentalhardwareknowledge(howdoesCPU,MICandGPU
differ,andwhatdoesitmeanformycode)§ Needtraininginperformanceprofiling
§ RegularKokkos Tutorials§ ~200slides,9hands-onexercisestoteachparallelprogramming
techniques,performanceconsiderationsandKokkos§ NowdedicatedECPKokkossupportproject:developonlinesupport
community§ ~200HPCdevelopers(mostlyfromDOElabs)hadKokkostrainingsofar38
![Page 39: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/39.jpg)
KeepingApplicationsHappy§ Neverunderestimatedevelopersabilitytofindnewcorner
cases!!§ HavingaProgrammingModeldeployedinMiniApps orasinglebigappis
verydifferentfromhavinghalfadozenmulti-millionlinecodecustomers.§ 538Issuesin24months§ 28%aresmallenhancements§ 18%biggerfeaturerequests§ 24%arebugs:oftencornercases
390
100
200
300
400
500
600
Issues
Issues since 2015
Enhancements
Feature Request
Bug
Compiler Issue
Questions
Other
§ Example:Subviews§ Initiallydatatypeneededtomatch
includingcompiletimedimensions§ Allowcompile/runtimeconversion§ AllowLayoutconversionifpossible§ Automaticallyfindbestlayout§ Addsubviewpatterns
![Page 40: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/40.jpg)
TestingandSoftwareQuality
Develop
ReleaseVersion
Compilers GCC(4.8-6.3),Clang (3.6-4.0),Intel(15.0-18.0),IBM(13.1.5,14.0),PGI(17.3),NVIDIA(7.0-8.0)
Hardware IntelHaswell,IntelKNL,ARMv8,IBMPower8,NVIDIAK80,NVIDIAP100
Backends OpenMP,Pthreads,Serial,CUDA
WarningsasErrors
FeatureDevelopment/Tests
Review
Tests
Master
Integration
Trilinos, LAMMPS,…Multiconfig integrationtest
Nightly,UnitTests
NewFeaturesaredeveloped onforks,andbranches.Limitednumberofdeveloperscanpush todevelopbranch.Pullrequestsarereviewed/tested.
Eachmergeintomasterisminor release.Extensiveintegrationtestsuiteensuresbackwardcompatibility,andcatchingofunit-testcoveragegaps.
![Page 41: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/41.jpg)
BuildinganEcoSystem
41
Algorithms(Random, Sort)
Containers(Map, CrsGraph, Mem Pool)
Kokkos(Parallel Execution, Data Allocation, Data Transfer)
Kokkos – Kernels(Sparse/Dense BLAS, Graph Kernels, Tensor
Kernels)
Kokk
os–
Tool
s(K
okko
saw
are
Pro
filin
g an
d D
ebug
ging
Too
ls)
Trilinos(Linear Solvers, Load Balancing, Discretization, Distributed Linear
Algebra)
Kokk
os–
Supp
ort C
omm
unity
(App
licat
ion
Sup
port,
Dev
elop
er T
rain
ing)
ApplicationsMiniApps
std::thread OpenMP CUDA ROCm
![Page 42: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/42.jpg)
KokkosTools
42
§ Utilities§ KernelFilter: Enable/DisableProfilingforaselectionofKernels
§ KernelInspection§ KernelLogger:Runtimeinformationaboutentering/leavingKernelsand
Regions§ KernelTimer:Postprocessing informationaboutKernelandRegionTimes
§ MemoryAnalysis§ MemoryHighWater: MaximumMemoryFootprintoverwholerun§ MemoryUsage:PerMemorySpaceUtilizationTimeline§ MemoryEvents:PerMemorySpaceAllocationandDeallocations
§ ThirdPartyConnector§ VTune Connector:MarkKernelsasFramesinsideofVtune§ VTune FocusedConnector:MarkKernelsasFrames+start/stopprofiling
https://github.com/kokkos/kokkos-tools
![Page 43: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/43.jpg)
Kokkos-Tools:ExampleMemUsage§ Toolsareloadedatruntime
§ Profileactualreleasebuildsofapplications§ Setvia:exportKOKKOS_PROFILE_LIBRARY=[PATH_TO_PROFILING_LIB]
§ Outputdependsontool§ Oftenperprocessfile
§ MemoryUsage providesperMemorySpace utilizationtimelines§ TimestartswithKokkos::initialize§ HOSTNAME-PROCESSID-CudaUVM.memspace_usage
43
# Space CudaUVM# Time(s) Size(MB) HighWater(MB) HighWater-Process(MB)0.317260 38.1 38.1 81.80.377285 0.0 38.1 158.10.384785 38.1 38.1 158.10.441988 0.0 38.1 158.1
![Page 44: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/44.jpg)
Kokkos-Kernels
44
§ ProvideBLAS(1,2,3);Sparse;GraphandTensorKernels§ NorequireddependenciesotherthanKokkos§ Localkernels(noMPI)§ HooksinTPLssuchasMKLorcuBLAS/cuSparsewhereapplicable§ Providekernelsforalllevelsofhierarchicalparallelism:
§ GlobalKernels:useallexecutionresourcesavailable§ TeamLevelKernels:useasubsetofthreadsforexecution§ ThreadLevelKernels:utilizevectorization insidethekernel§ SerialKernels:provideelementalfunctions (OpenMP declareSIMD)
§ Workstartedbasedoncustomerpriorities;expectmulti-yeareffortforbroadcoverage
§ People:ManydevelopersfromTrilinos contribute§ Consolidatenodelevelreusablekernelspreviouslydistributed overmultiple
packages
![Page 45: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/45.jpg)
Kokkos-Kernels:DenseBlasExample
45
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
3 5 10 15
Per
form
ance
GFl
op/s
Matrix Block Size
Batched DGEMM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
0
5E+10
1E+11
1.5E+11
2E+11
3 5 10 15
Per
form
ance
GFl
op/s
Matrix Block Size
Batched TRSM 32k blocks
MKL @ KNL KK @ KNL
CUBLAS @ P100 KK @ P100
§ Batchedsmallmatricesusinganinterleavedmemorylayout§ Matrixsizesbasedoncommonphysicsproblems:3,5,10,15§ 32ksmallmatrices§ Vendorlibrariesgetbetterformoreandlargermatrices
![Page 46: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/46.jpg)
KokkosUsersSpread
46
§ Usersfromadozenmajorinstitutions§ Morethantwodozenapplications/libraries
§ Includingmanymulti-million-lineprojects
UsersUSERS
Backen
dOptim
izatio
ns
![Page 47: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/47.jpg)
FurtherMaterial§ https://github.com/kokkos Kokkos Github Organization
§ Kokkos: Corelibrary,Containers,Algorithms§ Kokkos-Kernels: SparseandDenseBLAS,Graph,Tensor(underdevelopment)§ Kokkos-Tools: ProfilingandDebugging§ Kokkos-MiniApps:MiniApp repositoryandlinks§ Kokkos-Tutorials: ExtensiveTutorialswithHands-OnExercises
§ https://cs.sandia.gov Publications(searchfor’Kokkos’)§ ManyPresentationsonKokkos anditsuseinlibrariesandapps
§ TalksatthisGTC:§ CarterEdwardsS7253“TaskDataParallelism” ,Today10:00,211B§ Ramanan Sankaran,S7561“HighPres.ReactingFlows”,Today1:30,212B§ PierreKestener,S7166“HighRes.FluidDynamics”,Today14:30,212B§ MichaelCarilli,S7148“LiquidRocketSimulations”,Today16:00,212B§ Panel,S7564“AcceleratorProgrammingEcosystems”,Tuesday16:00,Ball3§ TrainingLab,L7107“Kokkos,Manycore PP”,Wednesday16:00,LL21E 47
![Page 48: Kokkos: The C++ Performance Portability Programming Modelon-demand.gputechconf.com/gtc/2017/presentation/s7344-christian … · Christian Trott (crtrott@sandia.gov), Carter Edwards](https://reader034.vdocuments.us/reader034/viewer/2022042123/5e9d9fe6537793769f5c1f4f/html5/thumbnails/48.jpg)
http://www.github.com/kokkos