sw4lite: performance portability using raja...llnl-pres-737128 2 §proxy for sw4( seismic waves...

20
LLNL-PRES-737128 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC SW4Lite: Performance Portability using RAJA Ramesh Pankajakshan Bjorn Sjogreen

Upload: others

Post on 10-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-737128This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

SW4Lite:PerformancePortabilityusingRAJA

RameshPankajakshanBjornSjogreen

Page 2: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371282

§ ProxyforSW4(SeismicWaves4th order)

§ ElasticwaveequationusingFD

§ Explicit4th orderaccuratetimemarching

§ Dominantcomputekernelis125pointstencilevaluation(74%)— Supergrid 15%

§ C++codederivedfromFortran

§ Codeversions— BaselineOMP3.0— CUDA— RAJA

SW4Lite

Page 3: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371283

§ Platformspecificmemoryallocator— Malloc— CudaMallocManaged— Hbwmalloc

§ Memorymarshalling

§ RAJA::forallN forcomputekernels

RAJAport

Page 4: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371284

InitialPerformanceLOH215MillionGridPoints

CTS-1Xeon E5 2695 v4

ATS-1KNL in Quad

cache

Coral EAMinsky node

Reference 146 93 17RAJA 290 220 23

RAJA/Reference ~2 ~2.4 ~1.4

Page 5: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371285

#pragmaomp parallelfor

for(k..for(j..#pragmaivdep#pragmasimd

for(i..{STENCIL_EVAL}

Triplenestedloopkernel(74%)

Page 6: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371286

forallN<EXEC_POLICY,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i){STENCILEVAL});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec,RAJA::simd_exec >>EXEC_POLICY;

TripleloopRAJAimplementation

Page 7: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371287

forallN<EXEC_POLICY,int,int>(RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j){#pragmaivdep#pragmasimdfor(int i=istart;i<iend;i++){STENCILEVAL

}});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec >>EXEC_POLICY;

PerformantTripleloopRAJAimplementation

Page 8: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371288

for(c…#pragmaomp parallelforfor(k..for(j..#pragmaivdep#pragmasimd

for(i..

Quadnestedloopkernel(15%)

Page 9: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-7371289

forallN<EXEC_POLICY,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i){for(c…STENCILEVAL});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec,RAJA::simd_exec >>EXEC_POLICY;

QuadloopRAJAimplementation

Page 10: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712810

for(int c…

forallN<EXEC_POLICY,int,int>(RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j){#pragmaivdep#pragmasimdfor(int i=istart;i<iend;i++){STENCILEVAL

}});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec >>EXEC_POLICY;

PerformantQuadloopRAJAimplementation

Page 11: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712811

PostSplitPerformance

CTS-1Xeon E5 2695 v4

ATS-1KNL in Quad

cache

Coral EAMinsky node

Reference 146 93 17RAJA 144 98 78

RAJA/Reference ~1 ~1 ~4.6

Page 12: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712812

forallN<EXEC_POLICY,int,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int i,int c){STENCILEVAL});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec,RAJA::simd_exec,RAJA::seq_exec>,Permute<PERM_LIJK..>>EXEC_POLICY;

QuadloopRAJAimplementation

Page 13: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712813

InlineandPermutationPerformance

CTS-1Xeon E5 2695 v4

ATS-1KNL in Quad

cache

Coral EAMinsky node

Reference 146 93 17RAJA 149 96 22

RAJA/Reference ~1 ~1 ~1.3

Page 14: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712814

§ Prefetch onCoral

§ Strideoneaccess

§ Threadblockconfiguration

§ VectorizationontheCTS-1andATS-1machines

§ Cacheandflatmodeperformedthesame

§ Whatisacceptableportability?

Observations

Page 15: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712815

BC Comm Scheme Supergrid Forcing Total

CTS-1 4 (3) 113 23 8 149

ATS-1 23 (25) 49 10 3 95

Coral EA 8 (37) 6 1 5 21

ComponentRuntimes

Page 16: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712816

§ Loopabstractionsviablepathtoperformanceportability

§ AdditionalabstractionsrequiredtomatchGPUperformance

§ Optimizationslaggingbehindforlambdas

§ Abstractmemorymodel

Conclusion

Page 17: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712817

Questions&Comments

Page 18: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712818

BC Comm BC Phys Scheme Supergrid Forcing Total

CTS-1 3.85 1.58 120.33 22.50 7.71 156.01ATS-1 23.51 2.19 49.17 10.43 3.40 95.44Coral EA 7.65 0.69 5.68 1.52 4.95 20.77

ComponentRuntimes

Page 19: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712819

FinalPerformance

CTS-1Xeon E5 2695 v4

ATS-1KNL in Quad

cache

Coral EAMinsky node

Reference 146 93 17RAJA 144 98 23

RAJA/Reference ~1 ~1 ~1.4

Page 20: SW4Lite: Performance Portability using RAJA...LLNL-PRES-737128 2 §Proxy for SW4( Seismic Waves 4thorder) §Elastic wave equation using FD §Explicit 4thorder accurate time marching

LLNL-PRES-73712820

forallN<EXEC_POLICY,int,int,int,int>(RangeSegment(),RangeSegment(),RangeSegment(),RangeSegment(),[=]RAJA_DEVICE(int k,int j,int I,int c){STENCILEVAL});

WhereEXEC_POLICYis:typedef RAJA::NestedPolicy<RAJA::ExecList<RAJA::omp_parallel_for_exec,

RAJA::omp_parallel_for_exec,RAJA::simd_exec,

>>EXEC_POLICY;

QuadloopRAJAimplementation