intel®’parallel’studio’xe’2016’ composeredion · 2015-10-28 · windows* ! latent...

Intel® Software Conference 2015

Intel® Parallel Studio XE 2016 Composer Edi:on

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Introduction Overview, schedule, generic changes for Parallel Studio XE 2016

2

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3

Intel® Parallel Studio XE 2016 – Components Tools Composer Edition Professional Edition Cluster Edition

Intel® C++ compiler ☑ ☑ ☑ Intel® Fortran compiler ☑ ☑ ☑

Intel® Math Kernel Library ☑ ☑ ☑ Intel® Threading Building Blocks library ☑ ☑ ☑

Intel® Integrated Performance Primitives ☑ ☑ ☑ Intel® Cilk™ Plus parallel model ☑ ☑ ☑

OpenMP* 4.0 ☑ ☑ ☑ Intel® Advisor XE ☑ ☑

Intel® Inspector XE ☑ ☑ Intel® VTune™ Amplifier XE ☑ ☑

Intel® Data Analytic Acceleration Library ( Intel® DAAL)

☑

☑

Intel® MPI library ☑ Intel® Trace Analyzer and Collector ☑

Rogue Wave IMSL* Library Bundled & Add-on Add-on Add-on

New !


Intel® Parallel Studio XE Composer Edition Intel® Parallel Studio XE Composer Edition

2015 Update 2 2016 Beta mid April 2015

2016 Release Q3/2015

Intel® C++ Compiler (ICC) 15.0 L:15.0.2.164 W:14.0.2.179

15.0 L:16.0.0.036 W:16.0.0.042

15.0 L:16.0.0.??? L:16.0.0.???

Intel® Math Kernel Library ( Intel® MKL )

11.2.2 11.3 11.3

Intel® Integrated Performance Primitives (Intel® IPP)

8.2 9.0 9.0

Intel® Threading Building Blocks (Intel® TBB)

4.3 4.4 4.4

Debug solution for Linux* Intel-extended GDB based on v. 7.8

Intel-ext. GDB based on 7.8

Intel-ext. GDB based on 7.xxx

§  Beta test just started ( April, WW14/15 )

§  Please contact presenter to join beta testing !


Overall Support unchanged from Intel® Parallel Studio XE 2015

§  Linux* §  Debian* 6 no longer supported §  CentOS* 6.5 (64-bit only) Intel® Graphics Technology only §  Redhat 5.x : Still supported but deprecated !

§  Windows* §  Latent support for Windows* 10 and Microsoft Visual Studio 2015* §  Microsoft Visual Studio 2013* Shell replaces Visual Studio 2010* Shell

§  OS X* §  Latest OS X* and Xcode* supported - currently OS X* 10.10 (Yosemite) and Xcode* 6.x §  32-bit Mac hardware *not* supported §  32-bit Mac application development is supported

5

Operating System and IDE Support


Intel® Parallel Studio XE 2016 Layout

<Installation Directory>

compiler_and_libraries_<version>.<update>.<pkg>

documentation_<version>

ide_support_<version>

samples_<version>

debugger_<version>

parallel_studio_xe_<version>.<update>.<pkg_psxe>

system_studio_<version>.<update>.<pkg_iss>

inde_<version>.<update>.<pkg_inde>

advisor_<version>

inspector_<version>

trace_analyzer_and_collector_<version>

vtune_amplifier_<version>

Shared components

Consistent version, update, pkg ids

Symbolic links


Intel® Parallel Studio XE 2016 Layout

<Installation Directory> compiler_and_libraries_<version>.<update>.<pkg> <target OS>

bin <host arch>[_<target arch>]

compiler include <target_arch>

cilk

lib <target_arch>[_<target_os_subset>]

ipp

mkl

tbb

mpi

documentaion_<version>

ide_support_<version>

samples_<version>

debugger_<version>

parallel_studio_xe_<version>.<update>.<pkg_psxe>

Shared components

Sy

mbo

lic

links

Compiler and Libraries for specific target OSes


Intel® Parallel Studio XE 2016 employs new feature-based names

Feature-based names require issuing a new license file §  Customer must obtain new license file to use Intel® Parallel Studio XE 2016

§  New license file supports all releases

§  Contains both new feature-based (i.e. supports Intel® Parallel Studio XE 2016) and previous Product-based feature codes (i.e. supports Intel® Parallel Studio XE 2015 and earlier)

Existing license file provide ongoing support for releases prior to Intel® Parallel Studio XE 2016

8

Licensing Changes New License File


Changing Option-Names on Linux* starting by ”-o” Switch names starting with –o are renamed to start by –qo except “-o <object files>

§  Change done to be compatible to GCC and LLVM switch naming convention

§  However key new reason is popular ccache utility which doesn’t work with option names like “-openmp”

§  Really a design issue in ccache

9

Renaming Samples

Old Name New Name

-opt-report -qopt-report

-openmp -qopenmp

-opt-malloc -qopt-malloc

-offload -qoffload

§  Change process started by 16.0 compiler release:

§  Names w/o ‘q’ prefix still accepted

§  Release version might print deprection message


New Features of all Compilers OpenMP, vectorization, optimization reports

10


Support for Features in OpenMP* 4.1 Technical Report 3 §  Non-structured data allocation

§  omp target [enter | exit ] data §  Asynchronous offload

§  nowait clause on omp task §  Dependence (signal)

§  depend clause on omp task

§  Map clause extensions §  Modifiers always and delete

Available for C/C++ and Fortran

Note: Standard not released yet – very likely in Q4/2015

11

OpenMP* 4.1 Extensions


simdlen (i.e. vectorlength) and safelen for loops §  Usable with #pragma simd (Intel Cilk™ Plus) and omp simd (OpenMP*)

Array reductions §  Fortran only (available in Beta update)

User-defined reductions §  Supported for parallel in C/C++ for POD types. No support for Fortran, SIMD, or non-POD types (C+

+)

omp-simd collapse(N) clause §  Available in a Beta update

FP-model honoring for simd loops

12

Improvements in Vectorization Intel® Cilk™ Plus and OpenMP* 4.0


Improvements in Vectorization Other internal improvements

13

Alignment analysis §  Information propagation improved

§  __assume_aligned() fixed

Memory reference analysis §  Resolved all “subscript/dereference too

complex” cases

§  More convoluted cases optimized to use vector loads

Improvements for AVX512 §  conflict/compress/expand idioms

improved

Improved optimization reports

Uniformity analysis and handling §  Scalar control flow and scalar

computations

§  Benefits to memory reference analysis

Local target control supported §  Vectorization properly targeted, e.g.

#include <immintrin.h> void foo1(float *y, float *a, float *b, int n) { if ( _may_i_use_cpu_feature(_FEATURE_AVX2)) { for (int i=0; i < n; ++i) y[i] = a[i]*y[i] + b[i]; // use FMA } else { for (int i=0; i < n; ++i) y[i] = a[i]*y[i] + b[i]; } }


§  Syntax: C++

Fortran

§  BLOCK_LOOP enables greater control over optimizations on specific DO/for loop inside a nested loop

§  Uses loop blocking technique to separate large iteration counted loops into smaller iteration groups

§  Smaller groups can increase efficiency of cache space use and augment performance

§  Works seamlessly with other directives including SIMD

14

Loop Blocking Pragma/Directive

#pragma block_loop [clause[,clause]...] #pragma noblock_loop

!DIR$ BLOCK_LOOP [clause[[,] clause]...] !DIR$ NOBLOCK_LOOP

clause: factor ( expr ) level ( levels ) private ( var1 [,var2 ]...


Loop Blocking Sample #pragma block_loop factor(250) level(2) for (i=0; i < m; i++) {

for (j=0; j < m; j++) { c[i]+=a[i][j]*b[j]; } }

for (jj=0;jj<m/250+1;jj++) { for (i=0; i < m; i++)

{ for (j=jj*250; j < min((jj+1)*250,m);j++)

{ c[i] += a[i][j]*b[j];

} } }

Original Source Code:

Outline of code after compiler loop transformations: Note: It is not always safe to interchange the iteration variables due to dependencies between statements for the order they execute. This safety check will be performed by the compiler !


Syntax: ü  C++ #pragma omp ordered [simd] newline or #pragma simdoff structured code block

ü  Fortran !$omp ordered [simd] structured code block !$omp end ordered

Semantics: ü  The ordered with simd clause construct specifies a structured block in the simd loop or SIMD function that

will be executed in the order of the loop iterations or sequence of call to SIMD functions. Rules: ü  #pragma simdoff/#pragma omp ordered simd is only allowed inside a SIMD loop or SIMD-enabled function. ü  A simdoff region must be a single-entry and single-exit code block ü  The strict ordered execution is only guaranteed for the block itself

ü  Execution remains weakly ordered w.r.t. to outside of the block or other ordered blocks ü  Data dependencies between statements of the same block will be correctly resolved ü  Other non-vector dependencies originating in ordered block still lead to undefined behavior

16

Ordered Blocks in SIMD Contexts


Ordered Examples

17

for _Simd (i = 0; i < N; i++) { ... #pragma simdoff { a[indices[i]] += b[i];// index conflict } ... #pragma simdoff { if (c[i] > 0) q[j++] = b[i]; // compress } ... #pragma simdoff { lock(L) // atomic update if (x > 10) x = 0; unlock(L) } ... #pragma simdoff { a[indices[i]] += b[i];// still OK } }

for _Simd (i = 0; i < N; i++) { ... #pragma simdoff { if (c[i] > 0) q[j++] = b[i]; // compress } ... #pragma simdoff { if (c[i] > 0) // Order will change q[j++] = d[i]; // compared to serial } }

OK: Not OK:


+-

Fighting Data Dependencies inside Loop Using #pragma simdoff and array reductions

#pragma simd for(int i=0; i < VL; i++) { … val = values[i]; grp = groups[i]; #pragma simdoff // index conflict { g_total[grp] += val; } … }

0 3 2 3 0 2 1 2

5 7 8 9 3 6 5 3

5 0 0 0 3 0 0 0 0 0 0 0 0 0 5 0 0 0 8 0 0 6 0 3 0 7 0 9 0 0 0 0

8 5 17 16 g_

tota

l Pr

ivat

e co

pies

redu

ce

Solution: array reductions grp : val :

+=

#pragma simd reduction(+:g_total) for(int i=0; i < VL; i++) { … val = values[i]; grp = groups[i]; g_total[grp] += val; … }

0 0 0 0 0

0 0 0 0 0 0 0

0 3 2 3 0 2 1 2 grp (indices):

5 7 8 9 3 6 5 3 v (values):

8 5 17 16 g_total:

+= +=

? ? ? 8 14 7 5


Adjacent Gathers

19

§  Three basic forms §  Support for non-unit-strided accesses §  Support for indirect accesses §  Support for stencil codes

§  Replace series of gathers with a series of vector loads and sequence of permutes

§  In stencil case reduce number of gathers/loads

§  Support is case-driven: important cases get priority §  Not a generic solution §  Simple cases supported in 15.0 §  Much more in 16.0

§  Submit your important cases !!

for (int i=start_idx i<end_idx; i++) { TYPE acc_x_0 = 0, acc_y_0 = 0, acc_z_0 = 0; for (int j=b_start_idx; j<b_end_idx; j+=1) { TYPE dt_x_0 = In_X[(j+0)] - In_X[i]; TYPE dt_y_0 = In_Y[(j+0)] - In_Y[i]; TYPE dt_z_0 = In_Z[(j+0)] - In_Z[i]; acc_x_0 += s_0*dt_x_0; acc_y_0 += s_0*dt_y_0; acc_z_0 += s_0*dt_z_0; } Out_V[3*(i+0)+0] += delta_t * acc_x_0; Out_V[3*(i+0)+1] += delta_t * acc_y_0; Out_V[3*(i+0)+2] += delta_t * acc_z_0; }

for (int k = 0; k < numneighs; k++) { const int j = neighs[k]; double x = x[j * PAD + 0]; double y = x[j * PAD + 1]; double z = x[j * PAD + 2]; … }

do 2900 i=ibeg+iriter,iend,4 do 2070 j=jbeg-1,jend+1 . . . dqm = (v1(i,j,k) - v1(i,j-1,k)) * dx2bi(j) dqp = (v1(i,j+1,k) - v1(i,j,k)) * dx2bi(j+1) dq(j,1) = max (dqm * dqp, zro) * sign (one, dqm + dqp) / max(abs(dqm + dqp ), tiny) . . .

Don’t rely too much on compiler: use SoA layout if possible


How it works

20

for (int i = 0; i < size; ++i) { for (int j = i + 1; j < size; ++j) { xij = xi - data[3 * j]; yij = yi - data[3 * j + 1]; zij = zi - data[3 * j + 2];

data

Regs:

data[3*j]

data[3*j+1]

data[3*j+2]

Regs: 3 gathers

loads

Permutes

blends and shuffles

From 30% to 48% speed-up on KNC for size equal to 10000

KNC sequence is: •  3 pairs of loadunpacklpd/loadunpackhpd •  6 cross-lane permutations •  5 blends •  2 in-lane shuffles or swizzles


Intel® Advisor XE - Vectorization Advisor Data Driven Vectorization Design

21

Have you: §  Recompiled with AVX2, but seen little benefit? §  Wondered where to start adding vectorization? §  Recoded intrinsics for each new architecture? §  Struggled with cryptic compiler vectorization

messages?

Breakthrough for vectorization design §  What vectorization will pay off the most? §  What is blocking vectorization and why? §  Are my loops vector friendly? §  Will reorganizing data increase performance? §  Is it safe to just use pragma simd?

More Performance Fewer Machine Dependencies


A modified copy of a source code file with each line numbered and compiler diagnostics inserted after correspondent lines

The listing file can be generated in either a plain text or html format

Example:

22

Annotated Source Listing (ASL) (available in Beta Update 1)

1 int* foo(int* a, int* b, int upperbound){ 2 3 int* c = new int[upperbound]; 4 #pragma omp parallel for OpenMP DEFINED LOOP WAS PARALLELIZED 5 for (int i = 0; i < upperbound; ++i) { LOOP BEGIN at Test/library.cpp(5,2) <Peeled> LOOP END LOOP BEGIN at Test/library.cpp(5,2) remark #25460: No loop optimizations reported LOOP END 7 c[i] = a[i] + b[i]; 6 } 7 return c; 8 }


[/Q | -q]opt-report-annotate=[text | html]

§  Enable annotated source listing using specified format (Default: Disabled)

§  When enabled without format specification, format defaults to: text

[/Q | -q]opt-report-annotate-position=[caller | callee | both]

§  Enable annotated source listing and specify site where optimization messages appear for inlined cases of loop optimizations (Default: Disabled)

§  When enabled without position specification, site defaults to: caller

23

Annotated Source Listing (ASL) Compiler Options


What’s New in Intel® Fortran Compiler XE 2016 Intel® Fortran Compiler 16.0

24


Submodules from Fortran 2008

IMPURE ELEMENTAL from Fortran 2008

Further C Interoperability from Fortran 2015

Other New Features

§  ASYNCHRONOUS communication

§  -fpp-name option

§  VS2013 Shell

§  Uninitialized Variable Run-time Detection

25

New and Changed Features


Submodules (F2008) – The Problem

! Source source1.f90 use bigmod … Call sub1

! Source source2.f90 use bigmod … x = func2(…)

! Source source47.f90 use bigmod … call sub47

module bigmod … contains subroutine sub1 …<implementation of sub1> function func2 …<implementation of func2> subroutine sub47 …<implementation of sub47> … … end module bigmod

Some edit

Recompile

Recompile

Recompile

Recompile


Changes in the submodule do not force recompilation of uses of the module – as long as the interface does not change

27

Submodules (F2008) – The Solution

module bigmod … interface module subroutine sub1 … module function func2 … module subroutine sub47 … end interface end module bigmod

submodule (bigmod) bigmod_submod contains module subroutine sub1 … <implementation of sub1> module function func2 … <implementation of func2> module subroutine sub3 … <implementation of sub3> end submodule bigmod_submod


TS29113 on “Further Interoperability of Fortran with C” to be part of Fortran 2015. Motivations include:

§  Support needed for MPI-3 Fortran2008 language binding

§  See chapter 17.1 of MPI 3.0 Standard

§  Provide Fortran equivalent of C’s “void*” – assumed type and rank

§  Enable C code to manipulate array descriptors

§  Extend interoperable interfaces to ALLOCATABLE, POINTER, OPTIONAL, assumed shape, character assumed length

§  Extend ASYNCHRONOUS attribute beyond I/O

§  Relaxed restrictions

28

Further C Interoperability (F2015)


C-Descriptor Structure - CFI_cdesc_t

§  The new functionality requires the C-Code to have access to all the components describing the Fortran objects (arrays in general)

§  A structure containing (simplified) :

Type & Name Value void * base addr Base address of object size t elem len Storage size of a single element int version CFI_VERSION number CFI rank t rank Number of dimensions CFI type t type Number identifying the intrinsic, interoperable interface CFI attribute t attribute Identifies whether object is allocatable, a pointer etc CFI dim t dim[ ] For each dimension (rank) lower bound, extend and stride


Sample : MPI-3 Fortran MPI_F08 Language Binding

SUBROUTINE MPI_Send( buf, count, datatype, dest, & tag, comm, ierror ) TYPE(*), DIMENSION(..), INTENT(IN) :: buf INTEGER, INTENT(IN) :: count, dest, tag TYPE(MPI_Datatype), INTENT(IN):: datatype TYPE(MPI_Comm), INTENT(IN) :: comm INTEGER, OPTIONAL, INTENT(OUT) :: ierror END FUNCTION MPI_Send … CALL MPI_Send(y(1::2,:), size(y(1::2,:),KIND=c_int), MPI_INT, dest, tag, MPI_COMM_WORLD )

Fortran interface for MPI_Send routine (as defined in the MPI_F08 module from MPI‑3.0)


ASYNCHRONOUS Attribute §  Guarantee of correct asynchronous operations

§  Fortran has no pointer aliasing

§  Compilers tend to aggressively re-order code §  Compiler can move the code xx=buf(…) above the MPI_Wait()

REAL :: buf(100,100) TYPE(MPI_Request) :: req TYPE(MPI_Status) :: status ... ! Code that involves buf BLOCK ASYNCHRONOUS :: buf CALL MPI_Irecv( buf, size(buf), MPI_REAL, src, tag, & MPI_COMM_WORLD, req ) ... ! Overlapped computation that does not involve buf CALL MPI_Wait( req, status ) xx = buf(2,3) ! Without ASYNCHRONOUS, compiler could ! move code before MPI_Wait call END BLOCK


Uninitialized variable checking using [Q]init option is extended to local, automatic, and allocated variables of intrinsic numeric type

Example:

32

Uninitialized Variable Run-time Detection

4 real, allocatable, dimension(:) :: A 5 6 ALLOCATE(A(N)) 7 8 do i = 1, N 9 Total = Total + A(I) 10 enddo

$ ifort -init=arrays,snan -g -traceback sample.F90 -o sample.exe $ sample.exe forrtl: error (182): floating invalid - possible uninitialized real/complex variable. Image PC Routine Line Source ... sample.exe 0000000000402E12 MAIN__ 9 sample.F90 ... Aborted (core dumped)


OpenMP 4.1

•  TARGET NOWAIT – current task may continue execution without waiting for the target to finish

•  TARGET DEPEND – treated as if DEPEND had been specified for implicit TASK construct enclosing TARGET

[NO]BLOCK LOOP enables or disables loop blocking for following loop

New -fpp-name option

•  Lets you supply your own fpp preprocessor

VS2013 Shell replaces VS2010 Shell on Windows

33

Other New Features


C/C++ Specific New Features Intel® C/C++ Compiler 16.0

34


Compile Time Improvements •  Intrinsic headers emmintrin.h, immintrin.h etc provided by Intel

•  Very large files - thousands of prototypes for SSE, AVX, … intrinsics like extern __m128 _mm_shuffle_ps(__m128, __m128,unsigned int);

extern __m128 _mm_unpackhi_ps(__m128, __m128);

•  Opened and parsed before each compilation – takes much time !

•  Prototypes are now automatically disabled in headers

•  Use –D__INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES to restore old behaviour (enhanced type checking)


New Feature Support §  Unicode strings

§  C11 anonymous unions

New Keyword Support

_Generic Example

36

ANSI Standard C11 Standard Support

_Alignas _Alignof

_Static_assert _Thread_local

_Noreturn _Generic

#define pow(X) _Generic((X), long double: powl, \ default: pow, \ float: powf)(X)


§  Generic Lambdas

§  Generalized lambda captures

§  Digit Separators

§  [[deprecated]] attribute

§  Function return type deduction

§  Member initializers and aggregates

§  Feature test macros

37

C++14 Standard Support

Reference: C++14 FDIS http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf


New of C++14

x

auto glambda = [] (auto a) { return a; };

Generalized lambda captures

int x = 4; int z = [&r = x, y = x+1] { r += 2; // set x to 6; "R is for Renamed Ref" return y+2; // return 7 to initialize z }(); // invoke lambda

Generic lambdas

Function return type deduction

auto foo(int i) { if (i ==1) return i; else return foo(i-1)+i; }


See Technical Report: https://isocpp.org/std/standing-documents/sd-6-sg10-feature-test-recommendations

39

C++14: Feature Test Macros

x

#if __has_include("shared_mutex") // use standard header here #elif __has_include("boost/shared_mutex.h“) // use BOOST header #endif

Test for existence of compiler feature:

#ifndef __cpp_constexpr // no constexpr functionality available #elif __cpp_constexpr == 200704 // c++11 constexpr functionality available #else // c++14 constexpr functionality available #endif

Test for existence of compiler feature:


GNU* compatibility §  Enable C11 or C++14 support via options: –std=c11 –std=c++14

§  Supports same C++14 and C11 features except _Atomic (used in <stdexcept.h>) as GNU* 5.x (not released yet)

§  Standards support matches installed GNU* version (i.e. g++ in your PATH)

Microsoft* compatibility §  Enable C11 or C++14 features (beyond Microsoft* reference compiler) via options:

/Qstd=c11 /Qstd=c++11 /Qstd=c++14

§  Supports same C++14 and C11 features as Microsoft* Visual C++ 2015 (not released yet)

§  Compatible by default; features match system’s reference Microsoft* compiler

40

GNU* and Microsoft* Compatibility


§  SIMD (two operand) operators support with SSE types

§  128 and 256-bit SIMD types only §  operands must be of same type

§  Compiler option control of honoring parentheses §  -f[no-]protect-parens /Qprotect-parens[-]

Enable/disable (DEFAULT) optimizer honoring of parentheses around floating-point expressions (including complex and decimal)

41

Other New Features and Enhancements C/C++ [1]

+ - * / & | ^ += -= *=/= &= |= ^= == != > < >= <=

__m128i x,y,z; x = y + z;


§ Decimal floating point extension support now for Windows C/C++ too

§  See document “ISO/IEC JTC 1/SC 22/WG 14 N1912” at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1912.pdf

§ C++ wrappers for Intel® AVX512 vector operations §  Operations =, -, *, /, ^, |, &…), math (erf, sqrt, pow, …)

§  Integer classes (common AVX512 ISA): §  M512vec §  I[s,u]64vec8 (8 signed or unsigned 64-bit integers) §  I[s,u]32vec16 (16 signed or unsigned 32-bit integers).

§  FP classes §  F64vec8 (8 doubles) §  F32vec16 (16 floats)

§  Integer classes (AVX512BW) §  I[s,u]16vec32 (32 words) §  I[s,u]8vec64 (64 bytes)

42

Other New Features and Enhancements C/C++ [2]


§  New combined Parallel + Vector loop

or

§  Combined loop gives both parallelism (using threads) and vectorization

§  Behaves approximately like this pair of nested loops

§  The chunk size, M, is determined by the compiler and runtime

43

Intel® Cilk™ Plus Combined Parallel/SIMD loops

_Cilk_for _Simd (int i = 0; i < N; ++i) // Do something

#pragma simd _Cilk_for (int i = 0; i < N; ++i) // Do something

_Cilk_for (int i_1 = 0; i_1 < N; i_1 += M) for _Simd (int i = i_1; i < i_1 + M; ++i) // same as #pragma simd // Do something


Performance Libraries Intel® Math Kernel Library,

Intel® Integrated Performance Primitives,

Intel® Threading Building Blocks

44


Additional Sparse Matrix Vector Multiplication API

§  new two stage API for Sparse BLAS level 2 and 3 routines

MKL-MPI wrappers

§  all MPI implementations are API-compatible but MPI implementations are not ABI-compatible

§  MKL-MPI wrapper solves this problem by providing an MPI-independent ABI to MKL

Optimized HPCG (High Performance Conjugate Gradients) benchmark

§  designed to be more representative of common application workloads

45

Intel® Math Kernel Library 11.3 Beta


Support For Small Matrix multiplication

§  a single call executes independent ?GEMM operation simultaneously

Support for Philox4x35 and ARS5 RNG

§  two new pseudorandom number generators with a period of 2^128 are highly optimized for multithreaded environment

Sparse Solver SMP improvements

§  significantly improved overall scalability for Intel Xeon Phi coprocessors and scalability of the solving step for Intel Xeon processors

46

Intel® Math Kernel Library 11.3 Beta


Intel® Integrated Performance Primitives 9.0 Beta New Features

47

Additional optimization for Intel® processors with Intel® AVX2 instructions support

§  Intel® AVX2: computer vision, image processing optimization

New APIs to support external threading

New APIs to support external memory allocation

Improved CPU dispatcher §  Auto-initialization. No need for CPU initialization call

in static libraries

§  Code dispatching based on CPU features

Optimized cryptography functions to support SM2/SM3/SM4 algorithm Custom dynamic library building tool


IPP Focus Areas and Supporting Domains

Cryptography 600 primitives

Data Compression 150 primitives

String Processing 100 primitives

Computer Vision 700 primitives

Color Correction 500 primitives Image

Processing 3500 primitives

Signal Processing

2100 primitives

Vector Math 400 primitives

IPP Core 25 primitives

Image Processing &

Computer Vision

Data Compression

String Processing

Cryptography


New Intel® IPP Package Structure


OpenCV at Glance: > 500 functions OCV only Covered by IPP Legend:

50

General Image processing functions Image Pyramids

Image Descriptors

Camera calibration, Stereo, 3D

Segmentation

Transforms

Features

Tracking

Utilities and Data Structures

Fitting Machine Learning: •  Detection, •  Recognition

Matrix Math Intel® IPP covers ~60%


q Intel® IPP for OpenCV (ICV) – subset of Intel® IPP. It contains about 750 functions that are integrated into OpenCV 3.0.

q OpenCV 3.0 turned on ICV usage be default for x86 configuration.

q ICV 8.2 gives 1.7x speed up (geometric mean) on Haswell and 1.6x on Baytrail vs original “plain” OpenCV

51

Performance vs. Promise OpenCV 3.0 Performance Increases with IPP Optimizations


Intel® Threading Building Blocks – What’s New Fully supported tbb::task_arena

§  Task arenas provide improved control over workload isolation and the degree of concurrency.

Dynamic replacement of standard memory allocation routines for OS X*.

§  Utilize the powerful TBB scalable allocator easily on OS X

Binary files for 64-bit Android* applications were added as part of the Linux* OS package.

Improvements to the Flow Graph features

§  Don’t forget to check out Flow Graph Designer

Several Improvements to examples and documentation


GFX Compiler Offload Compiler for Intel® HD Graphic

53


Programming Model Features §  Shared Virtual Memory (available in Beta update)

§  Some OpenMP* 4.0

§  Improved asynchronous programming support

Performance Improvements §  Shared Local Memory

§  Tuned for 5th Generation Intel® Core™ processor

§  Improved vectorization for Gen target

Usability §  gfx_sys_check tool

§  Improved Debugging support 54

Intel® Graphics Technology


Adding ( some) OpenMP* 4.0 Offload Support bool Sobel::execute_offload() { int w = COLOR_CHANNEL_NUM * image_width; float *outp = this->output; float *img = this->image; int iw = image_width; int ih = image_height; #pragma omp target map(to: ih, iw, w) \ map(tofrom: img[0:iw*ih*COLOR_CHANNEL_NUM], \ outp[0:iw*ih*COLOR_CHANNEL_NUM]) #pragma omp parallel for collapse(2) for (int i = 1; i < ih - 1; i++) { for (int k = COLOR_CHANNEL_NUM; k < (iw - 1) * COLOR_CHANNEL_NUM; k++) { float gx = 1 * img[k + (i - 1) * w -1 * 4] + 2 * img[k + (i - 1) * w +0 * 4] + 1 * img[k + (i - 1) * w +1 * 4] - 1 * img[k + (i + 1) * w -1 * 4] - 2 * img[k + (i + 1) * w +0 * 4] - 1 * img[k + (i + 1) * w +1 * 4]; float gy = 1 * img[k + (i - 1) * w -1 * 4] - 1 * img[k + (i - 1) * w +1 * 4] + 2 * img[k + (i + 0) * w -1 * 4] - 2 * img[k + (i + 0) * w +1 * 4] + 1 * img[k + (i + 1) * w -1 * 4] - 1 * img[k + (i + 1) * w +1 * 4]; outp[i * w + k] = sqrtf(gx * gx + gy * gy) / 2.0; } } return true; }

Usability: -‐  Only a subset is supported -‐  ‘tofrom’ and ‘to’ maps to ‘pin’ -‐  -‐qopenmp-‐offload=gfx must be used to change

the compiler default omp target from MIC to GFX


Intel® MIC Architecture Enhancements for Offloading

56


New Features available for C/C++

§  Offload of structures with pointer members - enabling offload inside member function

§  Offload with MIC-only memory allocation - new modifiers targetptr and preallocated

§  Offload using Streams - new stream clause and associated APIs

Performance Improvements for C/C++ and Fortran

§  Asynchronous offload

§  Memory Allocation and Data Transfers

57

Offload Features for Intel® Xeon Phi™


Intel® Advanced Vector Extensions 512 (Intel® AVX-512) New instruction set extension for next generation Intel® MIC architecture ( code name Knights Landing – KNL) and future Intel® Xeon architecture

58


AVX-512 - Greatly increased Register File

XMM0-‐15 16-‐ bytes

YMM0-‐15 32 bytes

ZMM0-‐31 64 bytes

SSE AVX2

AVX-512

0

15

31

Vector Registers IA32 (32bit)

Intel64 (64bit)

SSE (1999)

8 x 128bit 16 x 128bit

AVX and AVX-2 (2011 / 2013)

8 x 256bit 16 x 256bit

AVX-512 (2014 – KNL)

8 x 512bit 32 x 512bit


The Intel® AVX-512 Subsets [1]

q  Comprehensive vector extension for HPC and enterprise q  All the key AVX-512 features: masking, broadcast… q  32-bit and 64-bit integer and floating-point instructions q  Promotion of many AVX and AVX2 instructions to AVX-512 q  Many new instructions added to accelerate HPC workloads

AVX-512 F: 512-bit Foundation instructions common between MIC and Xeon

q  Allow vectorization of loops with possible address conflict q  Will show up on Xeon

AVX-512 CD (Conflict Detection instructions)

q  fast (28 bit) instructions for exponential and reciprocal and transcendentals ( as well as RSQRT) q  New prefetch instructions: gather/scatter prefetches and PREFETCHWT1

AVX-512 extensions for exponential and prefetch operations

AVX-512 F

AVX-512CD

AVX-512ER

AVX-512PR


The Intel® AVX-512 Subsets [2]

q  All of (packed) 32bit/64 bit operations AVX-512F doesn’t provide q  Close 64bit gaps like VPMULLQ : packed 64x64 è 64 q  Extend mask architecture to word and byte (to handle vectors) q  Packed/Scalar converts of signed/unsigned to SP/DP

AVX-512 Double and Quad word instructions

q  Extent packed (vector) instructions to byte and word (16 and 8 bit) data type q MMX/SSE2/AVX2 re-promoted to AVX512 semantics

q  Mask operations extended to 32/64 bits to adapt to number of objects in 512bit q  Permute architecture extended to words (VPERMW, VPERMI2W, …)

AVX-512 Byte and Word instructions

q  Vector length orthogonality q Support for 128 and 256 bits instead of full 512 bit

q  Not a new instruction set but an attribute of existing 512bit instructions

AVX-512 Vector Length extensions

AVX-512DQ

AVX-512BW

AVX-512VL


Other New Instructions

q Set of instructions to implement checking a pointer against its bounds q Pointer Checker support in HW ( today a SW only solution of e.g. Intel Compilers ) q Debug and security features

Intel® MPX – Intel Memory Protection Extension

q  Fast implementation of cryptographic hashing algorithm as defined by NIST FIPS PUB 180

Intel® SHA – Intel Secure Hash Algorithm

q  needed for future memory technologies

Single Instruction – Flush a cache line

MPX

SHA

CLFLUSHOPT

Save and restore extended processor state XSAVE{S,C}


AVX-512 – KNL and future XEON §  KNL and future Xeon architecture share

a large set of instructions §  but sets are not identical

§  Subsets are represented by individual feature flags (CPUID)

Future Xeon Phi (KNL)

SSE*

AVX

AVX2*

AVX-512F

Future Xeon

SSE*

AVX

AVX2

AVX-512F

SNB

SSE*

AVX

HSW

SSE*

AVX

AVX2

NHM

SSE*

AVX-512CD AVX-512CD

AVX-512ER

AVX-512PR AVX-512BW

AVX-512DQ

AVX-512VL

MPX,SHA, …

Com

mon

Inst

ruct

ion

Set


Intel® Compiler Processor Switches

Switch Description -xmic-avx512 KNL only; already in 14.0 -xcore-avx512 Future XEON only, already in 15.0.1 -xcommon-avx512 AVX-512 subset common to both, already in

15.0.2 -m, -march, /arch Not yet ! -ax<…-avx512> Same as for “-x<…-avx512>” -mmic No – not for KNL


Memory Model of next Generation Intel® MIC Architecture Code Name Knights Landing ( KNL)

65


KNL Memory Modes


•  API is open-sourced (BSD licenses)

•  https://github.com/memkind

•  User jemalloc API underneath •  http://www.canonware.com/jemalloc/ •  https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-

jemalloc/480222803919

Malloc replacement:

67

High Bandwidth On-Chip Memory API

#include <memkind.h> hbw_check_available() hbw_malloc, _calloc, _realloc,… (memkind_t kind, …) hbw_free() hbw_posix_memalign() hbw_get_size(), _psize() ld … -ljemalloc –lnuma –lmemkind –lpthread


Debugging Intel Extended Gnu Debugger (GDB-IA)

68


Enhancements to Intel® version of GDB, the GNU* Project Debugger (Linux* only)

§  Improved OpenMP* support (tasks, task dependencies, teams & barriers)

§  Added Fortran intrisinc support (e.g. ASSOCIATED, ALLOCATED, UBOUND, …)

Improved debugging support for Intel® Graphics Technology

69

Debugging Enhancements


Summary / Call to Action

70


Register and Try our Beta Release §  Visit:

Send Feedback! §  Report issues via Intel Premier - https://premier.intel.com/

§  Please participate in our Beta Surveys, we value all comments!

Remember: New 2016 versions of all Intel® Parallel Studio XE tools – not only compiler and libraries !

71

Next Steps

bit.ly/psxe2016beta


Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

73

intel®’parallel’studio’xe’2016’ composeredion · 2015-10-28 · windows* ! latent...

Documents