assist: a feedback-directed optimization source …...assist: a feedback-directed optimization...
TRANSCRIPT
1
ASSIST: A Feedback-Directed Optimization
source to source transformation tool for
HPC applications
William Jalby, Y. Lebras, Andres S. Charif-Rubial
UVSQ/ECR
11th Parallel Tools Workshop – 11/09/2017
ASSIST
2
Outline
1. Introduction: motivation, goals
2. ASSIST
• Requirements
• Implementation & Design
• Available Transformations
3. Examples and Experimental Results
• ASSIST PGO Versus Intel PGO
• Other Transformations Apply to Real Applications
4. Conclusion
11th Parallel Tools Workshop – 11/09/2017
ASSIST
3
I - INTRODUCTION
11th Parallel Tools Workshop – 11/09/2017
ASSIST
4
Motivations
Combine source level knowledge and static/dynamic performance analyses is very attractive to perform accurate performance diagnostic
Source code V.S. actual executed code
Better understand memory related issues (dependencies, array accesses)
The Feedback Directed Optimization (FDO)/ Profile Guided Optimizations (PGO), are well known optimization approach used by compiler with its but…
Lack of information of what is really done
Limited in performance information used (loop trip count, branch behavior)
Limited in transformation power
Cannot be configured by the user
11th Parallel Tools Workshop – 11/09/2017
ASSIST
5
Goals
Basic idea: MAQAO is pretty good at performance problems diagnostic, we need to go further and fix performance issues.
ASSIST an “Auto-tuning” framework: for us, auto tuning essentially means fully automated
Exploiting MAQAO’s metrics & knowledge
Detecting & exploiting information from source code
Transformation driven framework: ideally dtect whether a transformation is beneficial or not
Full control on transformations
Help developers to maintain their code
Ensure portability
Ease code refactoring (e.g. change data types across a program)
Provide users with a mean to provide extra information that cannot be encoded in the program (i.e. programming language limitations)
11th Parallel Tools Workshop – 11/09/2017
ASSIST
6
II - ASSIST
11th Parallel Tools Workshop – 11/09/2017
ASSIST
7
Implementation & Design
11th Parallel Tools Workshop – 11/09/2017
Optimization Process
ASSIST
8
Requirements
Compiler infrastructure requirement
Allowing to manipulate the Abstract Syntax Tree (AST)
Performing source-to-source
Handling Fortran, C and C++ languages
The Rose Compiler
Meeting all these criteria
Robust to these languages
No equivalent when we started
11th Parallel Tools Workshop – 11/09/2017
ASSIST
9
Implementation & Design
ASSIST: Automatic Source-to-Source assISTant
Support the following input languages
• Fortran 77, 90, 95, 2003 / C / C++03
Readable output
• Special effort on indentation and spaces
Easy to use with a simple user interface
• Annotations
• Configuration file
Target audience
• User with the ability to modify/annotate the code
• Application developers
Integrated as a MAQAO Module
• Take advantage of the interconnection between the core (binary manipulation and analysis layers) and the modules
• Use the modules’ output to perform transformation(s)
• Extend MAQAO to source code manipulation
11th Parallel Tools Workshop – 11/09/2017
ASSIST
10
Available Transformations
Three types of transformations
User Interface
Annotations – Source code annotation
Configuration file – Describing line per line which transformation performed on which statement
11th Parallel Tools Workshop – 11/09/2017
ASSIST
AST Modifier
• Unroll
• Full unroll
• Interchange
• Strip mining
• Tilling
• Loop/Function Specialization
Directive(s) insertion
• Loop count (involving dynamic analyses)
Mix of both
• Block Vectorization
11
Transformations
Specialization
Transformation of type : AST Modifier
Specialization of integer parameters provides to the compiler optimizations opportunities
• Constant propagation
• Partial Dead Code Elimination
• Loop unrolling, tiling, block vectorization, etc
Single values or ranges can be defined
Two distinct cases
• Loop specialization
• Function specialization
11th Parallel Tools Workshop – 11/09/2017
ASSIST
12
Transformations
Loop Specialization Example
• Set bounds
• Conservatives : keep a generic version
11th Parallel Tools Workshop – 11/09/2017
ASSIST
13
Transformations
Function Specialization
• Partial Dead Code Elimination
• More information to perform another transformation
11th Parallel Tools Workshop – 11/09/2017
ASSIST
14
Transformations
Loop count
Loop oriented transformation of type : Directives insertion
Loop count knowledge enables the compiler to perform optimizations
• The compiler cannot always guess the loop trip count at compile time => it may refuse to vectorize
• Most of time simplifies
The control flow (less loop versions)
The choice of the vectorization/unrolling
Requires the dynamic feedback
Performed by VPROF (MAQAO module)
Returns the number of iterations of loops (min, max & average)
Limitation
• Loops’ bounds are dataset dependent
11th Parallel Tools Workshop – 11/09/2017
ASSIST
15
Example
Dynamic feedback example
Original loop
Extract of VPROF’s output
Exploiting the feedback
Return a file with corresponding directives
11th Parallel Tools Workshop – 11/09/2017
ASSIST
maqao s2s \
-vprof_xp=/home/ylebras/vprof_dir/vprof.csv \
-bin=/home/ylebras/NBP3.3.1/NPB3.3.1-SER/bin/is.B.x
for (i=0; i < NUM_KEYS; i++)
key_buff_ptr[key_buff_ptr2[i]]++;
#pragma loop_count max=134217728, 134217728, avg=134217728
for (i=0; i < NUM_KEYS; i++) {
key_buff_ptr[key_buff_ptr2[i]]++;
}
16
Transformations
Block Vectorization
Loop oriented transformation of type : Directives insertion & AST modifier
Performing a loop decomposition increase the vectorization ratio
Increasing the vectorization ratio by :
• Forcing the vectorization (“SIMD” Directive)
• Avoiding dynamic or static loop peeling transformation (use of UNALIGNED PRAGMA)
If the loop bound is not known at compile time
• The loop will be specialized by checking the modulo of a given input
11th Parallel Tools Workshop – 11/09/2017
ASSIST
Loop not
vectorized
by the
compiler
Target: AVX2
Body: DP
17
Transformations
Block Vectorization
Loop oriented transformation of type : Directives insertion & AST modifier
Performing a loop decomposition increase the vectorization ratio
Increasing the vectorization ratio by :
• Forcing the vectorization (“SIMD” Directive)
• Avoiding dynamic or static loop peeling transformation
If the loop bound is not known at compile time
• The loop will be specialized by checking the modulo of a given input
11th Parallel Tools Workshop – 11/09/2017
Loop
decomposition
Residual
ASSIST
Loop not
vectorized
by the
compiler
18
Example
Example of the block vectorization performed in AVBP (target architecture : Skylake)
Original loop
Extract of CQA’s output
11th Parallel Tools Workshop – 11/09/2017
ASSIST
In this case, “nproduct” is often called with the value “3”
Exploiting the CQA feedback
19
Example
Example of the block vectorization performed in AVBP (target architecture : Skylake using AVX2)
11th Parallel Tools Workshop – 11/09/2017
ASSIST
Step 1 –
Specialization of
the loop
Step 2 –
Apply block
vectorization
Keep a generic
version of the
code
20
Results
CQA report before and after block vectorization
11th Parallel Tools Workshop – 11/09/2017
Before
The loop is partially
vectorized
(33% of SSE/AVX
instructions are used
in vector mode) : Only
50% of vector length is
used.
33% of SEE/AVX loads
are used in vector
mode.
33% of SSE/AVX stores
are used in vector mode
After
Loop is vectorized
(all SSE/AVX
instructions are
used in vector
mode) but on 75%
vector length.
ASSIST
21
Transformations
Configuration file sample
• File: Source file path
• Arch: Architectures to support.
• Target a loop by its line number or by a label attached on the loop
A way to annotate an application without add directives in the source code
11th Parallel Tools Workshop – 11/09/2017
ASSIST
22
III – Experimental Results
11th Parallel Tools Workshop – 11/09/2017
ASSIST
23
Results
Test cases
NPB-3.3.1-SER (Fortran77/C) https://www.nas.nasa.gov/publications/npb.html
• NAS Parallel Benchmarks
Applications
AVBP (Fortran95) http://www.cerfacs.fr/avbp7x/
• A parallel CFD code that solves the three-dimensional compressible Navier-Stokes on unstructured and hybrid grids
Yales2 (Fortran2003) https://www.coria-cfd.fr/index.php/YALES2
• YALES2 aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes
Warp3D (Fortran77) http://www.warp3d.net/
• A research code for the solution of large-scale, 3-D solid models subjected to static and dynamic loads
ABINIT (Fortran90) https://www.abinit.org
• ABINIT is a software suite to calculate the optical, mechanical, vibrational, and other observable properties of materials
11th Parallel Tools Workshop – 11/09/2017
ASSIST
24
Results
Experimental setup
Compiled with icc17.0.4
Intel Skylake (Intel® Xeon® Platinum 8170 CPU@2,10GHz)
Multiple (around 30) executions to be statiscally meaning full and avoid outliers
PGO performance comparison
Original version
ICC’s PGO
ASSIST’s PGO like (loop count transformation)
Results of other transformations
Block Vectorization
Specialization
11th Parallel Tools Workshop – 11/09/2017
ASSIST
25
Results on NAS
Speedups with the ICC’s PGO versus loop count transformation compared to the original version
11th Parallel Tools Workshop – 11/09/2017
Number of loops processed with loop
count transformation
BT.B 34
CG.B 11
DC.B 5
EP.B 2
FT.B 6
IS.B 14
LU.B 49
MG.B 18
SP.B 79
UA.B 80
ASSIST
Not
significant
results
Many loop bounds
have been hard coded
26
Results on AVBP, Yales2 & Warp3D
Speedups with the ICC’s PGO versus loop count transformation compared to the original version
11th Parallel Tools Workshop – 11/09/2017
number of loops processed
with loop count transformation
1D_COFFE 122
3D_Cylinder 162
SIMPLE 158
NASA 149
test_68 57
Original
version
27
Results on AVBP(model = SIMPLE)
Speedup by function before and after applying function/loop specialization an block vectorization
11th Parallel Tools Workshop – 11/09/2017
Original
version
28
Results on AVBP(model = SIMPLE)
Execution time by function before and after applying function/loop specialization an block vectorization
11th Parallel Tools Workshop – 11/09/2017
29
Results on ABINIT(Ti-256)
Speedup with function specialization + tiling versus only specialization versus ICC’s PGO compared to the original version
11th Parallel Tools Workshop – 11/09/2017
ASSIST
Time (sec) Speedup
Original version
1,14 1,00
icc's PGO 1,14 1,00
ASSIST Spe
1,1 1,04
ASSIST Spe+Tilling
0,65 1,75
30
IV - Conclusion
11th Parallel Tools Workshop – 11/09/2017
ASSIST
31
Conclusion
A framework performing selective source-to-source transformations/optimizations guided by static/dynamic performance analysis.
An open source FDO tool
• Harnessing static and dynamic analyses from MAQAO
• Defining transformations on a per architecture basis either automatically or by the user
• Transformations done directly or by pragmas
Encouraging results
• Using the loop count transformation alone is already competitive with Intel’s PGO
• Block vectorization only needs a static analysis of the binary and provides significant speedup when the compiler failed to vectorize efficiently
• Automatic specialization allows to gain in maintainability and performance
11th Parallel Tools Workshop – 11/09/2017
ASSIST
32
Future work
Enhance our FDO tool
• Keep working on function/loop specialization, from annotation and automatic using feedback from MAQAO tools
• Use more data from dynamic feedback (hardware counters, static analyses)
• Enable the tool to launch MAQAO modules (autotuning mode) based on the detected opportunities
Unified view of source and binary level analyses
• Help application developers understand the gap between how the code should run and how it actually performs
Continue to work with our application developer partners on code maintainability features
Keep on adding other transformations based on MAQAO’s research work to detect more optimization opportunities
• Use multiple dataset as input
• Detect values for specialization
• …
11th Parallel Tools Workshop – 11/09/2017
ASSIST
33
Thanks
Any question ?
11th Parallel Tools Workshop – 11/09/2017
ASSIST
34
Requirements
Find a compiler infrastructure allowing to perform source-to-source transformations handling Fortran, C and C++ languages
11th Parallel Tools Workshop – 11/09/2017
License C C++ Fortran Source-to-source Documentation Weakness
GNU OSI ✓ ✓ ✓ ~ ~ GPL Licen
Misses information in AST
Cetus GPL ✓ x x ✓ ✓ Handle only C
Par4All MIT ✓ x ✓ ✓ Only for parallelism
LLVM BSD ✓ ✓ ~ ~ ~ No fortran when we stated Now first version of Flang
Rose BSD ✓ ✓ ✓ ✓ ✓ EDG license for C/C++
Orio BSD ~ x x ~ x Only subset of C
to other languages
✓ Requirement OK
~ Theoretically
possible / Weak
x Requirement KO
ASSIST
35
Transformations
Unroll
• Unroll the body of a loop by a N factor
• Allow to reduce instructions that control the loop
• Reduce branch penalties
• Help the compiler to vectorize
11th Parallel Tools Workshop – 11/09/2017
ASSIST
36
Transformations
Full Unroll
• The loop is replaced by the body fully unrolled
• Same advantage as previously
• Remove the loop overhead
11th Parallel Tools Workshop – 11/09/2017
ASSIST
37
Transformations
Interchange
• Better access to array elements
• Moving from Column-major to Raw-major or inverse.
11th Parallel Tools Workshop – 11/09/2017
ASSIST
38
Transformations
Strip Mine
• Reorganizes a loop to iterate over blocks of data sized to fit in the cache
11th Parallel Tools Workshop – 11/09/2017
ASSIST
39
Transformations
Tilling / Blocking
• Strip mining applied to two more dimensions
11th Parallel Tools Workshop – 11/09/2017
ASSIST
40
Transformations
Generic Block Vectorization
• If the loop bound is not know
The loop will be specialized by checking the modulo of a given input
11th Parallel Tools Workshop – 11/09/2017
ASSIST
41
Transformations
Generic Block Vectorization
• If the loop bound is not know
The loop will be specialized by checking the modulo of a given input
11th Parallel Tools Workshop – 11/09/2017
ASSIST
42
Results
AVBP(SIMPLE) : Block vectorization Versus the specialization of function or loop Execution Time and Speedup (compare to the original version)
11th Parallel Tools Workshop – 11/09/2017
ASSIST
time(s) Speedup time(s) Speedup time(s) Speedup time(s)
Original version
Function specialization
Loop specialization
Block vectorization (on best case)
grad_4obj 3,862 1,62 2,38 1,55 2,49 2,04 1,89
scatter_o_add 3,78 0,85 4,44 1,21 3,13 0,97 3,88
scatter_add 4,164 1 4,16 0,99 4,22 1,38 3,01
scatter_o_sub 2,63 0,98 2,69 1 2,62 1,21 2,17
gather_o_cpy 16,324 0,81 20,12 1,04 15,68 1,28 12,76
balance_cor 0,492 1 0,49 1 0,49 1,24 0,39
central 0,86 1,35 0,64 1,59 0,54 1,85 0,46
central_nv 0,945 1,6 0,59 1,21 0,78 2,65 0,36
mass_product 2,238 1,02 2,84 1,27 2,69 2,58 1,49
laxwe 2,278 0,79 2,23 0,83 1,8 1,51 0,88