measuring synchronisation and scheduling overheads in openmp j. mark bull epcc university of...

20
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: [email protected]

Upload: josephine-willis

Post on 29-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Measuring Synchronisation and Scheduling Overheads in OpenMP

J. Mark Bull

EPCC

University of Edinburgh, UK

email: [email protected]

Page 2: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Overview

Motivation

Experimental method

Results and analysis– Synchronisation – Loop scheduling

Conclusions and future work

Page 3: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Motivation

Compare OpenMP implementations on different systems.

Highlight inefficiencies.

Investigate performance implications of semantically equivalent directives.

Allow estimation of synchronisation/scheduling overheads in applications.

Page 4: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Experimental method

Basic idea is to compare same code executed with and without directives.

Overhead computed as (mean) difference in execution time.

e.g. for DO directive, compare:!$OMP PARALLEL

do j=1,innerreps

!$OMP DO

do i=1,numthreads to do j=1,innerreps call delay(dlength) call delay(dlength)

end do end do

end do

!$OMP END PARALLEL

Page 5: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Experimental method (cont.)

Similar technique can be used for PARALLEL (with and without REDUCTION clause), PARALLEL DO, BARRIER and SINGLE directives.

For mutual exclusion (CRITICAL, ATOMIC, lock/unlock) use a similar method, comparing

!$OMP PARALLEL

do j=1,innerreps/nthreads

!$OMP CRITICAL

call delay(dlength)

!$OMP END CRITICAL

end do

!$OMP END PARALLEL

to same reference time.

Page 6: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Experimental method (cont.)

Can use same method as for DO directive to investigate loop scheduling overheads.

For loop scheduling options, overhead depends on– number of threads– number of iterations per thread– execution time of loop body– chunk size

Large parameter space - fix first 3 and look at varying chunk size. – 4 threads– 1024 iterations per thread– 100 clock cycles to execute loop body

Page 7: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Timing

Need to take care with timing routines: – second differences of 32 bit floating point values (e.g .etime)

lose too much precision. – need microsecond accuracy (Fortran 90 system_clock isn’t

good enough on some systems)

For statistical stability, repeat each measurement 50 times per run, and for 20 runs of the executable.– observe significant variation between runs which is absent within a given run.

Reject runs with large standard deviations, or with large numbers of outliers.

Page 8: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Systems tested

Benchmark codes have been run on:

Sun HPC 3500, eight 400 MHz UltraSparcII processors, KAI guidef90 preprocessor, Solaris f90 compiler.

SGI Origin 2000, 40 195 MHz MIPS R10000 processors, MIPSpro f90 compiler

(access to 8 processors only)

Compaq Alpha server, four 525 MHz EV5/6 processors, Digital f90 compiler

Page 9: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Sun HPC 3500

Page 10: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

SGI Origin 2000

Page 11: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Compaq Alpha server

Page 12: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Sun HPC 3500

Page 13: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

SGI Origin 2000

Page 14: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Compaq Alpha server

Page 15: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Sun HPC 3500

Page 16: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

SGI Origin 2000

Page 17: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Compaq Alpha server

Page 18: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Observations

PARALLEL directive uses 2 barriers – is this strictly necessary? – PARALLEL DO cost twice as much as DO

REDUCTION clause scales badly – should use a fan-in method?

SINGLE should not cost more than BARRIER

Mutual exclusion scales badly on Origin 2000

CRITICAL directive very expensive on Compaq

Page 19: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Observations (cont.)

Small chunk sizes very expensive– compiler should generate code statically for block cyclic

schedule.

DYNAMIC much more expensive than STATIC, especially on Origin 2000

On Origin 2000 and Compaq, block cyclic is more expensive than block, even with one chunk per thread.

Page 20: Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK email: m.bull@epcc.ed.ac.uk

Conclusions and future work

Set of benchmarks to measure synchronisation and scheduling costs in OpenMP.

Show significant differences between systems.

Show some potential areas for optimisation.

Would like to run on more (and larger) systems.