the parallel research kernels, a tool for parallel systems

30

The Parallel Research Kernels, a tool for parallel systems investigations - Part I Rob Van der Wijngaart Intel Labs https://github.com/ParRes/Kernels *Parallel system=hardware system+network stack+OS+parallel programming environment (ProgEnv: programming model + API + compiler + runtime) AMS Seminar Series, NASA Ames Research Center, October 13, 2016

Upload: others

Post on 16-Oct-2021

1 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: The Parallel Research Kernels, a tool for parallel systems

The Parallel Research Kernels, a tool for parallel systems investigations - Part I

Rob Van der Wijngaart Intel Labs https://github.com/ParRes/Kernels

*Parallel system=hardware system+network stack+OS+parallel programming environment (ProgEnv: programming model + API + compiler + runtime)

AMS Seminar Series, NASA Ames Research Center, October 13, 2016

Page 2: The Parallel Research Kernels, a tool for parallel systems

Motivation Observations

•  Performance of full app mixture of multiple effects/interactions: hard to apply learnings to other apps

•  Hard to obtain useful data of full app on simulator (1 s * 1M = 11.6 days)

•  Can’t predict which apps (or languages, or ProgEnvs) important in 10 years

•  But: Can predict which fundamental parallel constructs/patterns will matter

Proposal: provide something simpler

•  Generic parallel-specific app patterns, i.e. parallel kernels

•  Each kernel is dominated by only one pattern

Page 3: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 4: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example esults

Page 5: The Parallel Research Kernels, a tool for parallel systems

Limitations •  Focused (mostly) on features stressed by parallel parts of application,

emphasizes parallel overhead, so may exaggerate parallelization impact

•  Not designed for full application performance projections

•  Single data structure, one or two hot loops: small data layout/alignment details may dominate performance

•  Not designed to measure robustness: fault tolerance, I/O performance

•  Not designed to measure ProgEnv productivity, due to kernel simplicity

•  Not designed to measure ProgEnv expressiveness; that battle had been fought … we thought

Page 6: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 7: The Parallel Research Kernels, a tool for parallel systems

Philosophy •  Broad range of important patterns found in real parallel applications

•  Reasonably self-contained for all of HPC

•  Paper-and-pencil specifications

•  Simple, understood by non-domain scientists (not algorithms, but patterns!)

•  Each kernel does some real work (data transformation). Corollaries:

−  Uniform performance metric = work/time

−  Work can be tested for correctness

•  Compact reference codes O(1-3 pages): easy porting to new ProgEnv

•  Performance expectations (simplified performance models)

Page 8: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 9: The Parallel Research Kernels, a tool for parallel systems

Context PRK are nothing new; PRK are different

Legend: ü: yes ~: meh ―: no ? : dunno

NPB CLOMP(I) EPCC HPCC SPEC MPI SPEC OMP PRK

arbitrary scale ― ü ü ü ― ― ü verification ü ~ ― ü ü ü ü

many runtimes ― ― ― ~ ― ― ü

pattern coverage ~ ― ― ― ? ? ü compact ― ü ü ~ ― ― ü

work metric ü ~ ― ü ü ü ü

performance model/expectation ― ― ― ― ― ― ü

PRK are like the English language: steal stuff from wherever you can and make it your own

Page 10: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 11: The Parallel Research Kernels, a tool for parallel systems

Usage model •  Analyze app, map patterns to kernels, study performance of kernels

•  If system does well on all relevant kernels, move to mini-app or actual application (method of elimination)

•  Example parallel application analysis: 1.  read lists of data

2.  do local sort into buckets

3.  send one bucket each to all other nodes

4.  merge incoming buckets

•  Useful PRKs: 1: Nstream, 2: Sparse, Random, or Refcount, 3: Transpose, 4: Nstream

Page 12: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example esults

Page 13: The Parallel Research Kernels, a tool for parallel systems

Reference implementations •  Portable:

−  plain C/Fortran serial reference implementations, no excessive tuning

−  no assembly/intrinsics/ libraries (except MKL’s DGEMM, optional)

•  Multiple parallel versions:

−  “Traditional”: OpenMP, MPI: one- and two-sided + hybrid (OpenMP, MPI3 SHM), CAF,AMPI, FG-MPI

−  Disruptive: Charm++, Grappa, UPC, OpenSHMEM, Legion, HPX, OCR, Chapel, HClib, …

−  Oddball: Julia, Python

•  Parameterized: problem size, #iterations, algorithmic choices

•  No input files; all initialization data synthesized

•  Automatic verification test: robust, nonintrusive, inexpensive

−  keeps users honest

−  facilitates porting/debugging

CAF=Fortran with co-arrays, OCR=Open Community Runtime, AMPI=Adaptive MPI, FG-MPI=Fine-grain MPI

Page 14: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 15: The Parallel Research Kernels, a tool for parallel systems

PRKs you should care about Because they:

•  Exhibit a range of granularities

•  Feature drastically different communication patterns

•  Proxy very important patterns in HPC

•  Contain both data parallel and non-data-parallel patterns

Page 16: The Parallel Research Kernels, a tool for parallel systems

Dense matrix transposition (transpose) Operation: A += (B++)T, A and B distributed identically, whole columns, column-major storage format

Proxy for: global data redistribution (cf FFT)

0 1 2 3

ranks 0 1 2 3

ranks

Global transpose

Local transpose

Granularity from very coarse to very fine, especially with strong scaling

Page 17: The Parallel Research Kernels, a tool for parallel systems

Point-to-point synchronization (synch_p2p) Operation: A(i,j) = A(i-1,j) + A(i,j-1) – A(i-1,j-1)

A(0,0) = -A(m-1,n-1) [to couple successive sweeps over the grid]

Proxy for: pipelined solution of problem with non-trivial 2-way dependencies

Thread 0 Thread 1 Thread 2 Thread 3

j i

Granularity very fine, no data parallelism

m

n

Page 18: The Parallel Research Kernels, a tool for parallel systems

Data parallel stencil (stencil) Operation: For all points in 2D grid, compute a += S(b++), where S is a stencil operation (box or star-shaped), a and b are scalar grid variables (2D arrays)

Proxy for: multi-dimensional array operations with spatial locality

Star-shaped stencil

Box-shaped stencil

Granularity medium, reuse factor depends on radius of stencil (tunable)

Page 19: The Parallel Research Kernels, a tool for parallel systems

Reference counting (refcount) Operation: Update pair(s) of reference “counters” (c1,c2) in tandem

Independent: (c1 c2)′ = (c1 c2)++

Coupled: (c1 c2)′ = R(α) (c1 c2)

Proxy for: mutual exclusion, high and low contention, simple and compound

αC = (1,0)

C′

•  Counters can be integer (independent only) or floating point

•  Mutex can be atomic, lock, or none •  Counter updates can be overlapped with

independent work (tunable) •  Counters can be privatized (uncontended locks)

Page 20: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 21: The Parallel Research Kernels, a tool for parallel systems

PRKs you may care about Because they:

•  Test some additional important synchronization constructs

•  Provide information about serial performance and compiler smarts

The list:

•  DGEMM (MKL or hand-coded): top flops

•  Nstream: top memory bandwidth

•  Synch_global: global synchronization (OpenMP barrier/MPI_Allgather)

•  Sparse: Sparse matrix-vector multiply: memory latency

•  Random: HPCC Random Access, fixed verification + small problem sizes: latency

•  Reduce: vector reduction

•  Branch: inner loop conditionals (vectorization), PC jumps, instruction cache

Page 22: The Parallel Research Kernels, a tool for parallel systems

Agenda •  Motivation

•  Limitations

•  Philosophy

•  Context

•  Usage model

•  Reference implementations

•  PRKs you should care about

•  PRK you may care about

•  Example results

Page 23: The Parallel Research Kernels, a tool for parallel systems

Results Following results obtained on NERSC Cray XC30 (Edison)

•  two 12-core Intel® Xeon® E5-2695 processors per node

•  Aries interconnect in a Dragonfly topology.

•  Intel 15.0.1.133 C/C++ compiler for all codes, except Cray Compiler Environment (CCE) 8.4.0.219 for Cray UPC, and GCC 4.9.2 was used for Grappa. Berkeley UPC compiler 2.20.2 was used with the same Intel C/C++ compiler. System library versions Cray MPT (MPI and SHMEM) 7.2.1, uGNI 6.0, and DMAPP 7.0.1

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance

Page 24: The Parallel Research Kernels, a tool for parallel systems

MPI+X based models win (X=OpenMP/MPI3)

Transpose, strong scaled (49152x49152*)

Aggregate performance MB/s

* Charm++: (47104x47104)

Page 25: The Parallel Research Kernels, a tool for parallel systems

Stencil, radius=4, strong scaled (49152x49152*)

Normalized performance (Mflops/#nodes)/Mflops_single_node_MPI1

* Charm++: (47104x47104)

Page 26: The Parallel Research Kernels, a tool for parallel systems

Synch_p2p, strong scaled (49152x49152*)

Aggregate performance MFlops

* Charm++: (47104x47104)

Page 27: The Parallel Research Kernels, a tool for parallel systems

Results Following results obtained on Xeon workstation

•  two 18-core Intel® Xeon ® CPU E5-2699 processors per node

•  Intel 17.0.0.098 C/C++ compiler with OpenMP enabled

•  All 18 cores used on exactly one processor

•  KMP_AFFINITY=granularity=fine,proclist=[{0,36},{1,37},{2,38},{3,39},{4,40},{5,41},{6,42},{7,43},{8,44},{9,45},{10,46},{11,47},{12,48},{13,49},{14,50},{15,51},{16,52},{17,53},{18,54},{19,55},{20,56},{21,57},{22,58},{23,59},{24,60},{25,61},{26,62},{27,63},{28,64},{29,65},{30,66},{31,67},{32,68},{33,69},{34,70},{35,71}],explicit (i.e. 1 thread/core)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance

Page 28: The Parallel Research Kernels, a tool for parallel systems

Refcount results, shared counters

0

1

2

3

4

5

6

7

8

0 8 16 32 64 128 256 512 1024 2048 4096 8192

MCUP/s

private work/thread

HSW 2x18 cores, 18 threads

int-atomic fp-atomic dependent fp-lock int-lock

Page 29: The Parallel Research Kernels, a tool for parallel systems

Summary •  PRK can be used to compare different aspects of parallel programming

environments

•  Growing set of reference implementations available: https://github.com/ParRes/Kernels

•  Join the PRK community to contribute or review implementations!

Page 30: The Parallel Research Kernels, a tool for parallel systems

A tool to approximate viability kernels, capture basins and resilience values

Graph Kernels - University of Chicagopeople.cs.uchicago.edu/.../VishwanathanGraphKernelsJMLR.pdfGRAPH KERNELS trast, in this paper we focus on kernels between graphs. The ﬁrst such

CAGE: A Tool for Parallel Genetic Programming Applications Gianluigi Folino

COE 509 Parallel Numerical Computing Lecture 4: The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK

COMPILING OPENCL KERNELS

AutoEncoders& Kernels

2014 Unit product orders & Commissions Training Tool Support Tools for Unit Popcorn Kernels Contact [email protected] 2014 Great Smoky Mountain

Parallel DNS algorithms on unstructured grids · The key computational kernels are scalar products, matrix–vector and matrix–matrix multiplies, while the communication patterns

Oak Ridge OpenSHMEM Benchmark Suite · NPB •NPB: NAS Parallel Benchmarks –Application kernels for common scientific algorithms –OpenSHMEM versions adapted from MPI variants

Real Time Kernels

Knn Kernels Margin

ParaMeter : A profiling tool for amorphous data-parallel programs

Parallel Processing Tool-box - HPC Home Pagehpc.kfupm.edu.sa/Documentation/MATLAB PARALLEL.pdf · Parallel Processing Tool-box Start up MATLAB in the regular way. This copy of MATLAB

Precision Parallel Grippers - RAD-RA · Box Slide Parallel Grippers Machine Tool Quality Thousands of Robotic Accessories box slide parallel grippers have demonstrated machine tool

Dynamic Compilation of Data-Parallel Kernels for Vector ...gpuocelot.gatech.edu/wp-content/uploads/ocelot-vectorization.pdf · Dynamic Compilation of Data-Parallel Kernels for Vector

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

Speculative Execution of Parallel Programs with Precise ...€¦ · preserving precise exception semantics. Our approach includes (1) auto-matic generation of OpenCL kernels and JNI

Approximate Tree Kernels

Toward an automatic parallel tool for solving systems of nonlinear equations

The eCos real-time operating system an open source tool to create embedded kernels and applications

PARADIS: An Efﬁcient Parallel Algorithm for Inplace Radix ... · PARADIS: An Efﬁcient Parallel Algorithm ... Radix sort can be one of the best suited sorting kernels for ... to

Ressort: An Auto-Tuning Framework for Parallel Shuffle Kernels · Ressort: An Auto-Tuning Framework for Parallel Shuffle Kernels Eric Love Electrical Engineering and Computer Sciences

Troubleshooting E1 Kernels

Isotropic Design of a Parallel Machine-Tool Mechanism

12 Kernels from generative models - People › ~jordan › kernels › 0521813972c12_p397-436.pdf398 Kernels from generative models 12.1 P-kernels Avery general type of kernel can

I A Programming Tool for the Development of Parallel ... · A Programming Tool for the Development of Parallel Computer Vision and Image Processing Algorithms and Applications

Properties of kernels - University of California, Berkeleyjordan/kernels/0521813972c03_p47... · Properties of kernels As we have seen in Chapter 2, ... 50 Properties of kernels with

CampProf: A Visual Performance Analysis Tool for Memory ... · CampProf: A Visual Performance Analysis Tool for Memory Bound GPU Kernels Ashwin M. Aji, Mayank Daga, Wu-chun Feng Dept

Parallel Extension of a Dynamic Performance Forecasting Tool · 2017. 1. 27. · Parallel Extension of a Dynamic Performance Forecasting Tool Eddy Caron, Fr ed eric Suter To cite

Extensible Kernels

ComPASS : a tool for distributed parallel finite volume ......COMPASS : A TOOL FOR DISTRIBUTED PARALLEL FINITE VOLUME DISCRETIZATIONS ON GENERAL UNSTRUCTURED POLYHEDRAL MESHES E. Dalissier1,

1 The Parallel Runtime Though parallel computers run Linux kernels and though compilation is largely routine, there are a few aspects of parallel computers