assist: a feedback-directed optimization source …...assist: a feedback-directed optimization...

1

ASSIST: A Feedback-Directed Optimization

source to source transformation tool for

HPC applications

William Jalby, Y. Lebras, Andres S. Charif-Rubial

UVSQ/ECR

11th Parallel Tools Workshop – 11/09/2017

ASSIST

2

Outline

1. Introduction: motivation, goals

2. ASSIST

• Requirements

• Implementation & Design

• Available Transformations

3. Examples and Experimental Results

• ASSIST PGO Versus Intel PGO

• Other Transformations Apply to Real Applications

4. Conclusion


ASSIST

3

I - INTRODUCTION


ASSIST

4

Motivations

Combine source level knowledge and static/dynamic performance analyses is very attractive to perform accurate performance diagnostic

Source code V.S. actual executed code

Better understand memory related issues (dependencies, array accesses)

The Feedback Directed Optimization (FDO)/ Profile Guided Optimizations (PGO), are well known optimization approach used by compiler with its but…

Lack of information of what is really done

Limited in performance information used (loop trip count, branch behavior)

Limited in transformation power

Cannot be configured by the user


ASSIST

5

Goals

Basic idea: MAQAO is pretty good at performance problems diagnostic, we need to go further and fix performance issues.

ASSIST an “Auto-tuning” framework: for us, auto tuning essentially means fully automated

Exploiting MAQAO’s metrics & knowledge

Detecting & exploiting information from source code

Transformation driven framework: ideally dtect whether a transformation is beneficial or not

Full control on transformations

Help developers to maintain their code

Ensure portability

Ease code refactoring (e.g. change data types across a program)

Provide users with a mean to provide extra information that cannot be encoded in the program (i.e. programming language limitations)


ASSIST

6

II - ASSIST


ASSIST

7

Implementation & Design


Optimization Process

ASSIST

8

Requirements

Compiler infrastructure requirement

Allowing to manipulate the Abstract Syntax Tree (AST)

Performing source-to-source

Handling Fortran, C and C++ languages

The Rose Compiler

Meeting all these criteria

Robust to these languages

No equivalent when we started


ASSIST

9

Implementation & Design

ASSIST: Automatic Source-to-Source assISTant

Support the following input languages

• Fortran 77, 90, 95, 2003 / C / C++03

Readable output

• Special effort on indentation and spaces

Easy to use with a simple user interface

• Annotations

• Configuration file

Target audience

• User with the ability to modify/annotate the code

• Application developers

Integrated as a MAQAO Module

• Take advantage of the interconnection between the core (binary manipulation and analysis layers) and the modules

• Use the modules’ output to perform transformation(s)

• Extend MAQAO to source code manipulation


ASSIST

10

Available Transformations

Three types of transformations

User Interface

Annotations – Source code annotation

Configuration file – Describing line per line which transformation performed on which statement


ASSIST

AST Modifier

• Unroll

• Full unroll

• Interchange

• Strip mining

• Tilling

• Loop/Function Specialization

Directive(s) insertion

• Loop count (involving dynamic analyses)

Mix of both

• Block Vectorization

11

Transformations

Specialization

Transformation of type : AST Modifier

Specialization of integer parameters provides to the compiler optimizations opportunities

• Constant propagation

• Partial Dead Code Elimination

• Loop unrolling, tiling, block vectorization, etc

Single values or ranges can be defined

Two distinct cases

• Loop specialization

• Function specialization


ASSIST

12

Transformations

Loop Specialization Example

• Set bounds

• Conservatives : keep a generic version


ASSIST

13

Transformations

Function Specialization

• Partial Dead Code Elimination

• More information to perform another transformation


ASSIST

14

Transformations

Loop count

Loop oriented transformation of type : Directives insertion

Loop count knowledge enables the compiler to perform optimizations

• The compiler cannot always guess the loop trip count at compile time => it may refuse to vectorize

• Most of time simplifies

The control flow (less loop versions)

The choice of the vectorization/unrolling

Requires the dynamic feedback

Performed by VPROF (MAQAO module)

Returns the number of iterations of loops (min, max & average)

Limitation

• Loops’ bounds are dataset dependent


ASSIST

15

Example

Dynamic feedback example

Original loop

Extract of VPROF’s output

Exploiting the feedback

Return a file with corresponding directives


ASSIST

maqao s2s \

-vprof_xp=/home/ylebras/vprof_dir/vprof.csv \

-bin=/home/ylebras/NBP3.3.1/NPB3.3.1-SER/bin/is.B.x

for (i=0; i < NUM_KEYS; i++)

key_buff_ptr[key_buff_ptr2[i]]++;

#pragma loop_count max=134217728, 134217728, avg=134217728

for (i=0; i < NUM_KEYS; i++) {

key_buff_ptr[key_buff_ptr2[i]]++;

}

16

Transformations

Block Vectorization

Loop oriented transformation of type : Directives insertion & AST modifier

Performing a loop decomposition increase the vectorization ratio

Increasing the vectorization ratio by :

• Forcing the vectorization (“SIMD” Directive)

• Avoiding dynamic or static loop peeling transformation (use of UNALIGNED PRAGMA)

If the loop bound is not known at compile time

• The loop will be specialized by checking the modulo of a given input


ASSIST

Loop not

vectorized

by the

compiler

Target: AVX2

Body: DP

17

Transformations

Block Vectorization

Loop oriented transformation of type : Directives insertion & AST modifier

Performing a loop decomposition increase the vectorization ratio

Increasing the vectorization ratio by :

• Forcing the vectorization (“SIMD” Directive)

• Avoiding dynamic or static loop peeling transformation

If the loop bound is not known at compile time

• The loop will be specialized by checking the modulo of a given input


Loop

decomposition

Residual

ASSIST

Loop not

vectorized

by the

compiler

18

Example

Example of the block vectorization performed in AVBP (target architecture : Skylake)

Original loop

Extract of CQA’s output


ASSIST

In this case, “nproduct” is often called with the value “3”

Exploiting the CQA feedback

19

Example

Example of the block vectorization performed in AVBP (target architecture : Skylake using AVX2)


ASSIST

Step 1 –

Specialization of

the loop

Step 2 –

Apply block

vectorization

Keep a generic

version of the

code

20

Results

CQA report before and after block vectorization


Before

The loop is partially

vectorized

(33% of SSE/AVX

instructions are used

in vector mode) : Only

50% of vector length is

used.

33% of SEE/AVX loads

are used in vector

mode.

33% of SSE/AVX stores

are used in vector mode

After

Loop is vectorized

(all SSE/AVX

instructions are

used in vector

mode) but on 75%

vector length.

ASSIST

21

Transformations

Configuration file sample

• File: Source file path

• Arch: Architectures to support.

• Target a loop by its line number or by a label attached on the loop

A way to annotate an application without add directives in the source code


ASSIST

22

III – Experimental Results


ASSIST

23

Results

Test cases

NPB-3.3.1-SER (Fortran77/C) https://www.nas.nasa.gov/publications/npb.html

• NAS Parallel Benchmarks

Applications

AVBP (Fortran95) http://www.cerfacs.fr/avbp7x/

• A parallel CFD code that solves the three-dimensional compressible Navier-Stokes on unstructured and hybrid grids

Yales2 (Fortran2003) https://www.coria-cfd.fr/index.php/YALES2

• YALES2 aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes

Warp3D (Fortran77) http://www.warp3d.net/

• A research code for the solution of large-scale, 3-D solid models subjected to static and dynamic loads

ABINIT (Fortran90) https://www.abinit.org

• ABINIT is a software suite to calculate the optical, mechanical, vibrational, and other observable properties of materials


ASSIST

24

Results

Experimental setup

Compiled with icc17.0.4

Intel Skylake (Intel® Xeon® Platinum 8170 CPU@2,10GHz)

Multiple (around 30) executions to be statiscally meaning full and avoid outliers

PGO performance comparison

Original version

ICC’s PGO

ASSIST’s PGO like (loop count transformation)

Results of other transformations

Block Vectorization

Specialization


ASSIST

25

Results on NAS

Speedups with the ICC’s PGO versus loop count transformation compared to the original version


Number of loops processed with loop

count transformation

BT.B 34

CG.B 11

DC.B 5

EP.B 2

FT.B 6

IS.B 14

LU.B 49

MG.B 18

SP.B 79

UA.B 80

ASSIST

Not

significant

results

Many loop bounds

have been hard coded

26

Results on AVBP, Yales2 & Warp3D

Speedups with the ICC’s PGO versus loop count transformation compared to the original version


number of loops processed

with loop count transformation

1D_COFFE 122

3D_Cylinder 162

SIMPLE 158

NASA 149

test_68 57

Original

version

27

Results on AVBP(model = SIMPLE)

Speedup by function before and after applying function/loop specialization an block vectorization


Original

version

28

Results on AVBP(model = SIMPLE)

Execution time by function before and after applying function/loop specialization an block vectorization


29

Results on ABINIT(Ti-256)

Speedup with function specialization + tiling versus only specialization versus ICC’s PGO compared to the original version


ASSIST

Time (sec) Speedup

Original version

1,14 1,00

icc's PGO 1,14 1,00

ASSIST Spe

1,1 1,04

ASSIST Spe+Tilling

0,65 1,75

30

IV - Conclusion


ASSIST

31

Conclusion

A framework performing selective source-to-source transformations/optimizations guided by static/dynamic performance analysis.

An open source FDO tool

• Harnessing static and dynamic analyses from MAQAO

• Defining transformations on a per architecture basis either automatically or by the user

• Transformations done directly or by pragmas

Encouraging results

• Using the loop count transformation alone is already competitive with Intel’s PGO

• Block vectorization only needs a static analysis of the binary and provides significant speedup when the compiler failed to vectorize efficiently

• Automatic specialization allows to gain in maintainability and performance


ASSIST

32

Future work

Enhance our FDO tool

• Keep working on function/loop specialization, from annotation and automatic using feedback from MAQAO tools

• Use more data from dynamic feedback (hardware counters, static analyses)

• Enable the tool to launch MAQAO modules (autotuning mode) based on the detected opportunities

Unified view of source and binary level analyses

• Help application developers understand the gap between how the code should run and how it actually performs

Continue to work with our application developer partners on code maintainability features

Keep on adding other transformations based on MAQAO’s research work to detect more optimization opportunities

• Use multiple dataset as input

• Detect values for specialization

• …


ASSIST

33

Thanks

Any question ?


ASSIST

34

Requirements

Find a compiler infrastructure allowing to perform source-to-source transformations handling Fortran, C and C++ languages


License C C++ Fortran Source-to-source Documentation Weakness

GNU OSI ✓ ✓ ✓ ~ ~ GPL Licen

Misses information in AST

Cetus GPL ✓ x x ✓ ✓ Handle only C

Par4All MIT ✓ x ✓ ✓ Only for parallelism

LLVM BSD ✓ ✓ ~ ~ ~ No fortran when we stated Now first version of Flang

Rose BSD ✓ ✓ ✓ ✓ ✓ EDG license for C/C++

Orio BSD ~ x x ~ x Only subset of C

to other languages

✓ Requirement OK

~ Theoretically

possible / Weak

x Requirement KO

ASSIST

35

Transformations

Unroll

• Unroll the body of a loop by a N factor

• Allow to reduce instructions that control the loop

• Reduce branch penalties

• Help the compiler to vectorize


ASSIST

36

Transformations

Full Unroll

• The loop is replaced by the body fully unrolled

• Same advantage as previously

• Remove the loop overhead


ASSIST

37

Transformations

Interchange

• Better access to array elements

• Moving from Column-major to Raw-major or inverse.


ASSIST

38

Transformations

Strip Mine

• Reorganizes a loop to iterate over blocks of data sized to fit in the cache


ASSIST

39

Transformations

Tilling / Blocking

• Strip mining applied to two more dimensions


ASSIST

40

Transformations

Generic Block Vectorization

• If the loop bound is not know

The loop will be specialized by checking the modulo of a given input


ASSIST

41

Transformations

Generic Block Vectorization

• If the loop bound is not know

The loop will be specialized by checking the modulo of a given input


ASSIST

42

Results

AVBP(SIMPLE) : Block vectorization Versus the specialization of function or loop Execution Time and Speedup (compare to the original version)


ASSIST

time(s) Speedup time(s) Speedup time(s) Speedup time(s)

Original version

Function specialization

Loop specialization

Block vectorization (on best case)

grad_4obj 3,862 1,62 2,38 1,55 2,49 2,04 1,89

scatter_o_add 3,78 0,85 4,44 1,21 3,13 0,97 3,88

scatter_add 4,164 1 4,16 0,99 4,22 1,38 3,01

scatter_o_sub 2,63 0,98 2,69 1 2,62 1,21 2,17

gather_o_cpy 16,324 0,81 20,12 1,04 15,68 1,28 12,76

balance_cor 0,492 1 0,49 1 0,49 1,24 0,39

central 0,86 1,35 0,64 1,59 0,54 1,85 0,46

central_nv 0,945 1,6 0,59 1,21 0,78 2,65 0,36

mass_product 2,238 1,02 2,84 1,27 2,69 2,58 1,49

laxwe 2,278 0,79 2,23 0,83 1,8 1,51 0,88

assist: a feedback-directed optimization source …...assist: a feedback-directed optimization...

Documents