profiling your application with intel vtune at nersc - 1 -

12
Profiling your application with Intel VTune at NERSC - 1 -

Upload: reed-wiley

Post on 14-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Profiling your application with Intel VTune at NERSC - 1 -

Profiling your application with Intel VTune at NERSC

- 1 -

Page 2: Profiling your application with Intel VTune at NERSC - 1 -

VTune background and availability

• Focus: On-node performance analysis– Sampling and trace-based profiling– Performance counter integration– Memory bandwidth analysis– On-node parallelism: vectorization and threading

• Pre-defined analysis experiments• GUI and command-line interface (good for headless

collection and later analysis)• NERSC availability (as the vtune module)– Edison (Dual 12-core Ivy Bridge)– Babbage (Dual 8-core Sandy Bridge + Dual Xeon Phi)

- 2 -

Page 3: Profiling your application with Intel VTune at NERSC - 1 -

Running VTune on Edison I

• Use the Cray cc or ftn wrappers for the Intel compilers• Suggested compiler flags:

– -g : enable debugging symbols– -O2 : use production-realistic optimization levels (not -O0)

• To use VTune on Edison, you have to:– Run within a CCM job (batch or interactive)– Use dynamic linking if profiling OpenMP code (-dynamic)– Use a working directory on a Lustre $SCRATCH filesystem

- 3 -

edison09:BGW > ftn -dynamic -g -O2 -xAVX -openmp bgw.f90 -o bgw.x edison09:BGW > mkdir $SCRATCH/vtune-runsedison09:BGW > cp bgw.x $SCRATCH/vtune-runs/edison09:BGW > cd $SCRATCH/vtune-runs/edison09:vtune-runs > qsub -I -q ccm_int -l mppwidth=24 wait ...

Page 4: Profiling your application with Intel VTune at NERSC - 1 -

Running VTune on Edison II

• Once you’re in a CCM job (either interactive or batch script)– cd to your submission directory– Launch VTune to profile your code on a compute node with aprun

- 4 -

CCM Start success, 1 of 1 responsesnid02433:~ > cd $PBS_O_WORKDIRedison09:vtune-runs > module load vtunenid02433:vtune-runs > aprun -n 1 amplxe-cl -collect experiment_name -r result_dir -- ./bgw.x

• amplxe-cl is the VTune CLI– -collect : specifies the collection experiment to run– -r : specifies an output directory to save results

• Set OMP_NUM_THREADS and associated aprun options(-d, -S, -cc depth, -cc numa_node) as needed

• Results can be analyzed by launching amplxe-gui and navigating to the result directory (preferably in NX)

Page 5: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: General exploration

• Available on Edison and Babbage (SNB + Xeon Phi)

• Detailed characterization of relevant performance metrics throughout your application– Default: low-level detail aggregated into summary metrics

• Mouse-over for explanation of their significance• Can be used to characterize locality issues, poor vectorization, etc.

• Multiple “viewpoints” available:– Direct access to hardware event counters– Spin / sync overhead for OpenMP threaded regions

- 5 -

nid02433:vtune-runs > aprun -n 1 amplxe-cl -collect gener al-exploration –r ge_results -- ./bgw.x

Page 6: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: General exploration

- 6 -

A whole lot of summary metrics!

Page 7: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: General exploration

- 7 -

Filter by process and thread ID

Show loops as well as functions

Page 8: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: General exploration

- 8 -

Change viewpoint to get to hardware counters, hotspot analysis, and more

Page 9: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: Memory bandwidth

• Available on Edison and Babbage (Xeon Phi only)– Caveat: avoid Babbage SNB for now (node will lock up)

• Gives DRAM read / write traffic as a function of time during program execution

• Useful to first calibrate with a well-understood code on the same platform (e.g. STREAM)

• Can help determine whether your code is at least partially (effectively) BW bound

- 9 -

nid02433:vtune-runs > aprun -n 1 amplxe-cl –collect bandwi dth –r bw_results -- ./bgw.x

Page 10: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: Bandwidth

- 10 -

Average BW listed by CPU package

Page 11: Profiling your application with Intel VTune at NERSC - 1 -

Experiments: Bandwidth

- 11 -

Click and drag to zoom in for more detail

Read, write, and aggregate BW time series

Peak BW

OpenMP regions

Page 12: Profiling your application with Intel VTune at NERSC - 1 -

More resources

• At NERSC– On our debugging and profiling tools pages:

http://www.nersc.gov/users/software/debugging-and-profiling/vtune/

– More details on how to run your analysis on both the Edison compute nodes and the Babbage Xeon Phis

– Pointers to materials from previous NERSC trainings

• At Intel– Main documentation for 2015 version:

https://software.intel.com/en-us/node/529213– Detailed descriptions of the various experiment types– Pointers to tutorials on specific topics or platforms

- 12 -