sw4lite: performance portability using raja...llnl-pres-737128 2 §proxy for sw4( seismic waves...
TRANSCRIPT
LLNL-PRES-737128This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
SW4Lite:PerformancePortabilityusingRAJA
RameshPankajakshanBjornSjogreen
LLNL-PRES-7371282
§ ProxyforSW4(SeismicWaves4th order)
§ ElasticwaveequationusingFD
§ Explicit4th orderaccuratetimemarching
§ Dominantcomputekernelis125pointstencilevaluation(74%)— Supergrid 15%
§ C++codederivedfromFortran
§ Codeversions— BaselineOMP3.0— CUDA— RAJA
SW4Lite
LLNL-PRES-7371283
§ Platformspecificmemoryallocator— Malloc— CudaMallocManaged— Hbwmalloc
§ Memorymarshalling
§ RAJA::forallN forcomputekernels
RAJAport
LLNL-PRES-7371284
InitialPerformanceLOH215MillionGridPoints
CTS-1Xeon E5 2695 v4
ATS-1KNL in Quad
cache
Coral EAMinsky node
Reference 146 93 17RAJA 290 220 23
RAJA/Reference ~2 ~2.4 ~1.4
LLNL-PRES-7371285
#pragmaomp parallelfor
for(k..for(j..#pragmaivdep#pragmasimd
for(i..{STENCIL_EVAL}
Triplenestedloopkernel(74%)
LLNL-PRES-7371286
forallN<EXEC_POLICY,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i){STENCILEVAL});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec,RAJA::simd_exec >>EXEC_POLICY;
TripleloopRAJAimplementation
LLNL-PRES-7371287
forallN<EXEC_POLICY,int,int>(RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j){#pragmaivdep#pragmasimdfor(int i=istart;i<iend;i++){STENCILEVAL
}});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec >>EXEC_POLICY;
PerformantTripleloopRAJAimplementation
LLNL-PRES-7371288
for(c…#pragmaomp parallelforfor(k..for(j..#pragmaivdep#pragmasimd
for(i..
Quadnestedloopkernel(15%)
LLNL-PRES-7371289
forallN<EXEC_POLICY,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i){for(c…STENCILEVAL});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec,RAJA::simd_exec >>EXEC_POLICY;
QuadloopRAJAimplementation
LLNL-PRES-73712810
for(int c…
forallN<EXEC_POLICY,int,int>(RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j){#pragmaivdep#pragmasimdfor(int i=istart;i<iend;i++){STENCILEVAL
}});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec >>EXEC_POLICY;
PerformantQuadloopRAJAimplementation
LLNL-PRES-73712811
PostSplitPerformance
CTS-1Xeon E5 2695 v4
ATS-1KNL in Quad
cache
Coral EAMinsky node
Reference 146 93 17RAJA 144 98 78
RAJA/Reference ~1 ~1 ~4.6
LLNL-PRES-73712812
forallN<EXEC_POLICY,int,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i,int c){STENCILEVAL});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec,RAJA::simd_exec,RAJA::seq_exec>,Permute<PERM_LIJK..>>EXEC_POLICY;
QuadloopRAJAimplementation
LLNL-PRES-73712813
InlineandPermutationPerformance
CTS-1Xeon E5 2695 v4
ATS-1KNL in Quad
cache
Coral EAMinsky node
Reference 146 93 17RAJA 149 96 22
RAJA/Reference ~1 ~1 ~1.3
LLNL-PRES-73712814
§ Prefetch onCoral
§ Strideoneaccess
§ Threadblockconfiguration
§ VectorizationontheCTS-1andATS-1machines
§ Cacheandflatmodeperformedthesame
§ Whatisacceptableportability?
Observations
LLNL-PRES-73712815
BC Comm Scheme Supergrid Forcing Total
CTS-1 4 (3) 113 23 8 149
ATS-1 23 (25) 49 10 3 95
Coral EA 8 (37) 6 1 5 21
ComponentRuntimes
LLNL-PRES-73712816
§ Loopabstractionsviablepathtoperformanceportability
§ AdditionalabstractionsrequiredtomatchGPUperformance
§ Optimizationslaggingbehindforlambdas
§ Abstractmemorymodel
Conclusion
LLNL-PRES-73712817
Questions&Comments
LLNL-PRES-73712818
BC Comm BC Phys Scheme Supergrid Forcing Total
CTS-1 3.85 1.58 120.33 22.50 7.71 156.01ATS-1 23.51 2.19 49.17 10.43 3.40 95.44Coral EA 7.65 0.69 5.68 1.52 4.95 20.77
ComponentRuntimes
LLNL-PRES-73712819
FinalPerformance
CTS-1Xeon E5 2695 v4
ATS-1KNL in Quad
cache
Coral EAMinsky node
Reference 146 93 17RAJA 144 98 23
RAJA/Reference ~1 ~1 ~1.4
LLNL-PRES-73712820
forallN<EXEC_POLICY,int,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int I,int c){STENCILEVAL});
WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,
RAJA::omp_parallel_for_exec,RAJA::simd_exec,
>>EXEC_POLICY;
QuadloopRAJAimplementation