expressing and exploiting multi-dimensional locality in dash
TRANSCRIPT
Tobias Fuchs
LMU Munich, MNM Team
www.mnm-team.org
Expressing and Exploiting
Multi-Dimensional Locality
in DASH
SPPEXA Symposium 2016
2Expressing and Exploiting Multi-Dimensional Locality in DASH
Background
3Expressing and Exploiting Multi-Dimensional Locality in DASH
DASH
• Vision: “C++ standard template library for HPC”.
• Provides n-dim array abstraction for stencil- and dense matrix
operations.
• Realization of the PGAS (partitioned global address space)
programming model.
Background
4Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p = 42;
Background
5Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p = 42;dash::Array<T> a;a.local[4] = p;
Background
6Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p;dash::Array<T> a;p = a[40];
Background
7Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Locality (access distance to data) predominant factor for efficiency.
L = (local accesses) / (total accesses)
• Access pattern on data depends on implementation of algorithm.
• Complexity to maintain locality increases exponentially with the number
of data dimensions.
Objective and Approach
8Expressing and Exploiting Multi-Dimensional Locality in DASH
Objective
Portable efficiency by automatic deduction of optimal data distribution.
Approach
1. Identify distribution properties that allow well-defined specification of
any data distribution.
2. Let algorithms specify soft / hard constraints on distribution properties.
3. Derive optimal distribution for a given set of constraints.
Automatic deduction of optimal data distribution
Distribution Properties
9Expressing and Exploiting Multi-Dimensional Locality in DASH
Property Categories
Mappings in data distribution can be categorized by their stages:
Partitioning Decomposing the index domain to blocks
Mapping Assigning blocks to units
Layout Storage order of block elements in units’ local memory
Distribution Properties
10Expressing and Exploiting Multi-Dimensional Locality in DASH
Example: Morton Order Distribution
Category Properties
Partitioning balanced, regular, rectangular
Mapping balanced, minimal, neighbor
Layout blocked, linear, canonical
Use Cases
11Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that fulfills a set of properties.”
// Deduces pattern type, initializes pattern instance:auto pattern =
make_pattern< _partitioning_properties< |-- compile time deduction
balanced, regular >, | via C++11 generic meta template mapping_properties< | programming
neighbor > |layout_properties< |
blocked, row_major > _|> _(Size<2>(10000,10000), |-- run time deductionTeam<2>(24,24)); _|
Use Cases
12Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that is optimal for a given algorithm.”
// Deduce pattern from algorithm constraints:auto pattern = dash::make_pattern< dash::summa_pattern_constraints >(
Size<2>(10000,10000),Team<2>(24,24));
dash::Matrix<double, 2> matrix_a(pattern);dash::Matrix<double, 2> matrix_b(pattern);dash::Matrix<double, 2> matrix_c(pattern);
dash::summa(matrix_a, matrix_b, matrix_c);
Use Cases
13Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Algorithm
“Find algorithm variant that is optimal for a given data distribution.”
// Specify how data is distributed in global memory:auto pattern = dash::TilePattern<2>(10000,10000, TILED(100,100));
dash::Matrix<double, 2> matrix_a(pattern);dash::Matrix<double, 2> matrix_b(pattern);dash::Matrix<double, 2> matrix_c(pattern);// Selects matrix product algorithm variant that is optimal for the given// pattern:dash::multiply(matrix_a, matrix_b, matrix_c);
Use Cases
14Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Algorithm
“Find data distribution for the most efficient algorithm variant.”
// Use constraints of most efficient algorithm, usually SUMMA for DGEMM:auto pattern = dash::make_pattern< dash::multiply_pattern_constraints >(
Size<2>(10000,10000),Team<2>(24,24));
dash::Matrix<double, 2> matrix_a(pattern);dash::Matrix<double, 2> matrix_b(pattern);dash::Matrix<double, 2> matrix_c(pattern);// Calls dash::summadash::multiply(matrix_a, matrix_b, matrix_c);
Evaluation: DGEMM
15Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.
Evaluation: DGEMM
16Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS
Evaluation: SGEMM
17Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.
Evaluation: SGEMM
18Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS
Summary
19Expressing and Exploiting Multi-Dimensional Locality in DASH
Summary
• Optimal distribution of n-dim data depends on unmanageable multitude
of factors (topology, access pattern, data flow, …).
• We defined a universal classification of distribution properties.
• Property system allows automatic deduction of optimal data distribution
and algorithm variants at compile time and run time.
Works with any C++11 compiler (tested: Intel 14.0+, gcc 4.7+, clang).
• Work in progress: optimal data distribution for data flows.
Tobias Fuchs
www.mnm-team.org/~fuchst
DASH Project
www.dash-project.org
Visit for upcoming release