intel c++ compiler 14.0 within intel system studio• intel® c++ compiler 14.0 technical training...

98
Intel ® C++ Compiler 14.0 within Intel System Studio 1

Upload: others

Post on 11-Mar-2020

20 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Intel® C++ Compiler 14.0 within Intel System Studio

1

Page 2: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What you will learn from this slide deck

• Intel® C++ Compiler 14.0 technical training for

System & Application code running Linux*, Android* & Tizen™

• In-depth explanation of compiler specifics for each development environment mentioned above

• Please see subsequent slide decks for in-depth technical training on other components

2

Page 3: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compatibility to Standards

The Intel C++ Compiler provides the following

language conformances: ANSI/ISO standard for C language compilation

(ISO/IEC9899:1990)

ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language

3

Page 4: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Common Optimization Switches

4

Linux*

Disable optimization -O0

Optimize for speed (no code size increase) -O1

Optimize for speed (default) -O2

High-level loop optimization -O3

Create symbols for debugging -g

Multi-file inter-procedural optimization -ipo

Profile guided optimization (multi-step build) -prof-gen

-prof-use

Optimize for speed across the entire program

**warning: -fast def’n changes over time

-fast (same as: -ipo –O3 -no-prec-div -static -xHost)

Page 5: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimizations for latest generation Intel® Atom™ Processor

5

• Processor specific light-weight out-of-order instruction scheduler optimization

• Processor specific cache management and memory preload optimizations

• Loop optimizations and vectorizer taking advantage of SSE4.2 vector instructions.

Processor-specific Compiler Optimizations

Linux

-xSSE2 -xCORE-AVX2

-xSSE3 -xCORE-AVX-I

-xSSSE3 -xATOM_SSSE3

-xSSE4.1 -xATOM_SSE4.2

-xSSE4.2

-xAVX

-xHost

Imply an Intel cpu id check

Runtime message if try to run on unsupported processor

Page 6: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Additional new compiler features

•New Vectorization report level

• Adds information about the quality of the vector code generated and does not output the text of messages

• Report data is processed with a script to produce a summary report and text file which intersperses vectorization messages and user code

• Analysis script available at http://software.intel. com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report

• Requires Python 2.6.5 or newer

• Specified with -vec-report7

•-vec-report6 now gives alignment information

6

Page 7: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

The Seven Steps of Optimization

7 Intel Confidential

Page 8: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

8 Intel Confidential

Multithread your application 7.

Page 9: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

9 Intel Confidential

Multithread your application 7.

Page 10: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

General optimization options

• -O1

• Optimize code size, auto vectorization is turned off

• -O2

• Inlining

• vectorization

• -O3

• Loop optimization

• data pre-fetching

• -ansi-alias / -restrict / -no-prec-div

10 Intel Confidential

Page 11: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

11 Intel Confidential

Multithread your application 7.

Page 12: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

ICC Atom Specific optimization

• Optimization switch –xSSSE3_ATOM for Saltwell

– Intel Supplemental Streaming SIMD Extensions 3 (SSSE3)

– In Order

– Use of LEA for stack operations

– Instruction reordering

– Support for movbe instruction (-minstruction=movbe)

– Only use it for system development

• Optimization switch-xatom_sse4.2 for Silvermont

– Intel® Streaming SIMD Extensions 4.2 (SSE4.2)

– Out of order

GCC Note: use –mtune=atom or –mtune=slm, not –march=atom /slm

12 Intel Confidential

Page 13: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

SIMD Instruction Enhancements

70 instr

Single-Precision Vectors

Streaming operations

144 instr

Double-precision Vectors

8/16/32

64/128-bit vector integer

13 instr

Complex Data

32 instr

Decode

47 instr

Video

Graphics building blocks

Advanced vector instr

SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2

8 instr

String/XML processing

POP-Count

CRC

AES-NI

7 instr

Encryption and Decryption

Key Generation

AVX

~100 new instr.

~300 legacy sse instr updated

256-bit vector

3 and 4-operand instructions Intel® Atom

Saltwell

13 Intel Confidential

Intel® Atom

Silvermont

Page 14: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

SIMD Instruction Enhancements (2)

addss Scalar Single-FP Add

single precision FP data

scalar execution mode

addps Packed Single-FP Add

single precision FP data

packed execution mode

x4 x3 x2 x1

y4 y3 y2 y1

x4 x3 x2 x1+y1

x4 x3 x2 x1

y4 y3 y2 y1

x4+y4 x3+y3 x2+y2 x1+y1

14 Intel Confidential

Page 15: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Approaches to introduce vectorization

Assembler code (addps)

Vector intrinsic (mm_add_ps())

Compiler: Auto vectorization hints (#pragma ivdep, …)

Programmer control

Ease of use Compiler: Fully automatic vectorization

Cilk Plus Array Notation

User Mandated Vectorization

( SIMD Directive)

15 Intel Confidential

Page 16: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

16 Intel Confidential

Multithread your application 7.

Page 17: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Interprocedural Optimizations Extends optimizations across file boundaries

Compile & Optimize

Compile & Optimize

Compile & Optimize

Compile & Optimize

file1.c

file2.c

file3.c

file4.c

Without IPO

Compile & Optimize

file1.c

file4.c file2.c

file3.c

With IPO

-ip Only between modules of one source file

-ipo Modules of multiple files/whole application

17 Intel Confidential

Page 18: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Interprocedural Optimizations (IPO) Usage: Two-Step Process

Linking

icc -ipo main.o func1.o func2.o

Pass 1

Pass 2

mock object

executable

Compiling

icc -c -ipo main.c

icc –c –ipo func1.c

icc –c –ipo func2.c

18 Intel Confidential

Page 19: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What you should know about IPO

• O2 and O3 activate “almost” file-local IPO (-ip)

• IPO extends compilation time and memory usage

• IPO works for libraries too

• In-lining of functions is most important feature of IPO but there is much more

19 Intel Confidential

Page 20: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

20 Intel Confidential

Multithread your application 7.

Page 21: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Profile-Guided Optimizations (PGO)

Static analysis is limited:

• How often is x > y

• What is the size of count

• Which code is touched how often

Enhancements with PGO:

• More accurate branch prediction

• Better decision of functions to inline (help IPO)

• Basic block movement to improve instruction cache behavior

if (x > y) do_this(); else do that();

for(i=0; i<count; ++i)

do_work();

21 Intel Confidential

Page 22: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

PGO Usage: Three Step Process

Compile + link to add instrumentation icc -prof_gen prog.c

Execute instrumented program prog.exe (on a typical dataset)

Compile + link using feedback icc -prof_use prog.c

Dynamic profile: 12345678.dyn

Instrumented executable: prog.exe

Merged .dyn files: pgopti.dpi

Step 1

Step 2

Step 3

Optimized executable: prog.exe

22 Intel Confidential

Page 23: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.

23 Intel Confidential

Multithread your application 7.

Page 24: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

How do I know if a loop was vectorised

• vec-report[n]

> icc -vec-report MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS

VECTORIZED.

• always vectorize if safe

#pragma vector always [assert]

• always vectorize

#pragma simd

24 Intel Confidential

Page 25: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

GAP – Guided Automatic Parallelization

• Use compiler infrastructure to help developer

– Vectorization, parallelization and data transformations

– Extend diagnostic message for failed vectorization and parallelization by specific hints to fix problem

• Exploit multi-year experience brought into the compiler development

– Performance tuning knowledge based on dealing with numerous applications, benchmarks and compute kernels

• Does not influence code generation

25 Intel Confidential

Page 26: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

GAP – How it Works Compiler Switches for GAP [1]

Activate GAP and optionally define guidance level

• -guide[=level]

• -guide-vec[=level]

• -guide-par[=level]

• -guide-data-trans[=level]

• Optional argument level=1,2,3,4 controls extend of analysis: ‘4’ is most advanced / most detailed and is default

26 Intel Confidential

Page 27: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization Example [1]

void f(int n, float *x, float *y, float *z, float *d1, float *d2) {

for (int i = 0; i < n; i++)

z[i] = x[i] + y[i] – (d1[i]*d2[i]);

}

GAP Message:

g.c(6): remark #30536: (LOOP) Add -no-alias-args option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line 6. [VERIFY] Make sure that the semantics of this option is obeyed for the entire compilation. [ALTERNATIVE] Another way to get the same effect is to add the "restrict" keyword to each pointer-typed formal parameter of the routine "f". This allows optimizations such as vectorization to be applied to the loop at line 6. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other

27 Intel Confidential

Page 28: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization Example [2]

void mul(NetEnv* ne, Vector* rslt

Vector* den, Vector* flux1,

Vector* flux2, Vector* num

{

float *r, *d, *n, *s1, *s2;

int i;

r=rslt->data; d=den->data;

n=num->data; s1=flux1->data;

s2=flux2->data;

for (i = 0; i < ne->len; ++i)

r[i] = s1[i]*s2[i] +

n[i]*d[i];

}

GAP Messages (simplified):

1. “Use a local variable to host the upper-bound of loop at line 29 (variable:ne->len) if the upper-bound does not change during execution of the loop”

2. “Use “#pragma ivdep" to help vectorize the loop at line 29, if these arrays in the loop do not have cross-iteration dependencies: r, s1, s2, n, d”

28 Intel Confidential

Page 29: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Data Transformation Example

struct S3 {

int a;

int b; // hot

double c[100];

struct S2 *s2_ptr;

int d; int e;

struct S1 *s1_ptr;

char *c_p;

int f; // hot

};

peel.c(22): remark #30756: (DTRANS) Splitting the structure 'S3' into two parts will improve data locality and is highly recommended. Frequently accessed fields are 'b, f'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order:'b, f, s2_ptr, s1_ptr, a, c, d, e, c_p'. [VERIFY] The suggestion is based on the field references in current compilation …

for (ii = 0; ii < N; ii++){

sp->b = ii;

sp->f = ii + 1;

sp++;

}

29 Intel Confidential

Page 30: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Seven Steps of Optimization

Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

1.

2.

3.

4.

5.

6.

30 Intel Confidential

Multithread your application 7.

Tune Automatic Vectorization

Page 31: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel® Cilk™ Plus - Overview

Simple Keywords Set of keywords, for expression

of task parallelism:

cilk_spawn

cilk_sync

cilk_for

Reducers

(Hyper-objects) Reliable access to nonlocal variables

without races

cilk::reducer_opadd<int> sum(3);

Array Notation Provide data parallelism for sections of

arrays or whole arrays

mask[:] = a[:] < b[:] ? -1 : 1;

Elemental Functions Define actions that can be applied to whole or parts of arrays or scalars

Execution parameters Runtime system APIs, Environment variables, pragmas

Task parallelism

Data parallelism

31 Intel Confidential

Page 32: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel® Cilk™ Plus - Overview

• Intel® Cilk™ Plus (Language Extension to C/C++)

Easier Task & Data Parallelism 3 simple Keywords: cilk_for, cilk_spawn, cilk_sync

Intel® Cilk™ Plus Array Notation Save time with powerful vectorization

32

Minimize Software Re-Work for New Hardware

32

Page 33: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Reports – Optimization Report

Compiler switch: -opt-report-phase[=phase] (Linux*)

phase can be: ipo_inl - Interprocedural Optimization Inlining Report ilo – Intermediate Language Scalar Optimization hpo – High Performance Optimization hlo – High-level Optimization all – All optimizations (not recommended, output too verbose)

Control the level of detail in the report: -opt-report[0|1|2|3] (Linux*)

• If you do not specify the level (i.e. /Qopt-report, -opt-report) level 2 is being used.

Save report output to file: -opt-report-file=[file] (Linux*)

Vectorization subset report: /Qvec-report2, –vec-report2

33

Page 34: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Report Example icc –O3 –opt-report-phase=hlo -opt-report-phase=hpo

LOOP INTERCHANGE in loops at line: 7 8 9

Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )

Loop at line 8 blocked by 128

Loop at line 9 blocked by 128

Loop at line 10 blocked by 128

Loop at line 10 unrolled and jammed by 4

Loop at line 8 unrolled and jammed by 4

…(10)… loop was not vectorized: not inner loop.

…(8)… loop was not vectorized: not inner loop.

…(9)… PERMUTED LOOP WAS VECTORIZED

34 34

Page 35: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

High-Level Optimizer (HLO)

Compiler switches: -O2, -O3 (Linux*)

Loop level optimizations

• loop unrolling, cache blocking, prefetching

More aggressive dependency analysis

• Determines whether or not it‘s safe to reorder or parallelize statements

Scalar replacement

• Goal is to reduce memory by replacing with register references

35

Page 36: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

36

Interprocedural Optimizations (IPO) Multi-pass Optimization

• Interprocedural optimizations performs a static, topological analysis of your application!

• ip: Enables inter-procedural optimizations for current source file compilation

• ipo: Enables inter-procedural optimizations across files Can inline functions in separate files

Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Interprocedural dead code elimination, constant propagation and procedure

reordering • Enhances optimization when used in combination with other compiler features

Linux*

-ip

-ipo

Page 37: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

37

Interprocedural Optimizations (IPO) Usage: Two-Step Process

Linking

Linux* icc -ipo main.o func1.o

func2.o

Windows* icl /Qipo main.o func1.o

func2.obj

Pass 1

Pass 2

mock object

executable

Compiling

Linux* icc -c -ipo main.c func1.c

func2.c

Windows* icl -c /Qipo main.c func1.c

func2.c

Page 38: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

38

Interprocedural Optimizations Extends optimizations across file boundaries

Compile & Optimize

Compile & Optimize

Compile & Optimize

Compile & Optimize

file1.c

file2.c

file3.c

file4.c

Without IPO

Compile & Optimize

file1.c

file4.c file2.c

file3.c

With IPO

/Qip, -ip Only between modules of one source file

/Qipo, -ipo Modules of multiple files/whole application

Page 39: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Auto-Vectorization SIMD – Single Instruction Multiple Data

• Scalar mode – one instruction produces

one result

• SIMD processing – with SSE or AVX instructions

– one instruction can produce multiple

results

+ a[i]

b[i]

a[i]+b[i]

+

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

39

a

b

a+b

+

Page 40: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Vectorization is Achieved through SIMD Instructions & Hardware

40

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 128 Intel® SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float

Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float First introduced in 2011

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 127

X8

Y8

X8opY8

X7

Y7

X7opY7

X6

Y6

X6opY6

X5

Y5

X5opY5

128 255

Page 41: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Comparison of Ways Applications can Take Advantage of Vectorization

41

Effort Required

Code Maintain-ability

Performance Potential

Scale Forward

Assembly/Intrinsics Most Least Best No

Existing libraries such as Intel® IPP, Intel® MKL

Least Most Best Yes

Intel Compiler Auto-Vectorization

Least Most Good Yes

High-level Constructs Moderate Most Best Yes

Page 42: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiling for Intel® AVX and SSSE3 using Intel® C++ Compiler

Compile with –xavx (/Qxavx on Windows*)

• Main speedups are for floating point

– Integer 256 bit arithmetic instructions coming for AVX2

– Best if 32 byte aligned

-axavx (/Qaxavx) gives both SSE and AVX code paths

• use –x (/Qx) switches to modify the default SSE code path

– e.g. –axavx –xssse3_atom target Intel Core i7 and Intel Atom™ Processor simultaneously (/Qaxavx /Qxssse3_atom on Windows)

software.intel.com/en-us/articles/

how-to-compile-for-intel-avx/

software.intel.com/en-us/articles/atom-optimized-compiler/

42

Page 43: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Based Vectorization Extension Specification

Feature SIMD Extension

Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors

sse2

Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors

sse3

Supplemental Streaming SIMD Extensions 3 (SSSE3) as available in Intel® Core™2 Duo processors

ssse3

Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture

sse4.1

Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by Intel® Core™ i7 processors

sse4.2

Like ssse3 but also generates the MOVBE instruction that is available for the Intel® Atom™ processor and Intel® Centrino® Atom™ Processor Technology

ssse3_atom

Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel® Core™ processor family

avx

Intel® Advanced Vector Extension (Intel® AVX) including instructions offered by the 3rd generation Intel® Core processor

core-avx-i

Intel® Advanced Vector Extension 2 (Intel® AVX2) as provided by a future Intel processor

core-avx2

43

43

Page 44: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Reports – Vectorization Report

Compiler switch: -vec-report<n> (Linux)

Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info

44

Page 45: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Automatic Vectorization by Compiler

Intel Compiler will auto vectorize the source code for you if it can

Pros:

• Minimal effort required

• Maintainable – source code is not changed

• Portable across Intel SIMD architectures

• Optimal performance is possible in best cases

• Scales forward!

Cons:

• Compiler is conservative; will not generate unsafe code

=> Advanced optimization techniques help to improve Data Level Parallelization using Vectorization

45

Page 46: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Pointer Checker (C/C++)

46

• Out-of-bounds memory checking at runtime – Checks before any memory access through a pointer that the

pointer address is inside the object pointed to.

– Checks for accesses through pointers that have been freed.

• Enable pointer checker via compiler switches. -check-pointers=[none|write|rw]

• Enable checking for dangling pointer references: -check-pointers-dangling=[none|heap|stack|all]

• Enable checking of bounds for arrays without dimensions: -[no]check-pointers-undimensioned

• Intrinsics allow user to get lower/upper bounds associated with pointer and create / destroy bounds for a pointer. – void * __chkp_lower_bound(void **)

– void * __chkp_upper_bound(void **)

– void * __chkp_kill_bounds(void *p)

– void * __chkp_make_bounds(void *p, size_t size)

Page 47: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Inlining Functions When the compiler inlines a function call, the function's code gets inserted into the caller's instruction stream Benefits: Reducing overhead of calling a function

• writing the registers and parameters to/from stack

• restore the registers when the function returns.

Improving performance because the optimizer can procedurally integrate the called function and can do better optimizations – sub-expression elimination – copy propagation

Drawbacks:

Overuse of inlining can actually make programs slower. Depending on a function's size, inlining it can cause the code size to increase, resulting in more cache misses and more pressure on the instruction cache

The speed benefits of inline functions tend to diminish as the function grows in size. At some point the overhead of the function call becomes small compared to the execution of the function body, and the benefit is lost.

47

Page 48: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Floating Point Model

The Floating Point options allow to control the optimizations of floating-point instructions. These options can be used to tune the performance, level of accuracy or result consistency.

Accuracy Produce results that are “close” to the correct value

–Measured in relative error, possibly ulps (units in the last place)

Reproducibility Produce consistent results

–From one run to the next –From one set of build options to another –From one compiler to another –From one platform to another

Performance Produce the most efficient code possible

–Default, primary goal of Intel® Compilers

These objectives usually conflict! Wise use of compiler options lets you control the tradeoffs.

48

Page 49: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compiler Floating-Point Model

The Floating-Point Compiler Switch

–fp-model keyword (Linux*)

Lets you choose the FP semantics at a coarse granularity and specify the compiler rules for

– Value safety

– FP expression evaluation

– FPU environment access

– Precise FP exceptions

– FP contractions

– Abrupt underflow (flush to zero)

– Denormals are set to zero

– May improve performance, esp. if HW doesn‘t support denormals

49

Page 50: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Floating-Point Keywords

Controls consistency of floating point results by restricting certain optimizations. Values for keywords are

– fast[=1|2]; default is fast=1

– Allows „value-unsafe“ optimizations (=default)

– Allows aggressive optimizations at a slight cost in accuracy or consistency.

– Some additional approximations allowed with fast=2

– precise

– Enables only value-safe optimizations on floating point code.

– source

– Implies precise and enables intermediates to be computed in source precision.

– Source is the recommended form for the majority of situations on processors supporting Intel® 64 and IA-32 platforms when SSE are enabled with /QxSSE2 or higher.

50

Page 51: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Floating-Point Keywords (2)

– double

– Implies precise and enables intermediates to be computed in double or extended precision.

– Not avaliable in Intel® Fortran Compilers

– extended

– Rounds intermediate results to 64-bit (extended) precision

– Enables value safe optimization

– except

– Enables floating point exception semantics

– strict

– Strictest mode of operation, enables both the precise and except options and disables contractions (i.e., precise + strict + disable fma)

51

Page 52: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The –fp-model<key> Switch

52

Key Value

Safety

Expression

Evaluation

FPU

Environ.

Access

Precise FP

Exceptions

FP

contract

precise

source

double

extended

Safe

Varies

Source

Double

Extended

No No Yes

strict Safe Varies Yes Yes No

fast=1

(default) Unsafe Unknown No No Yes

fast=2 Very

Unsafe Unknown No No Yes

except

except-

*/**

*

*

*

*

*

Yes

No

*

*

* These modes are unaffected. –fp-model except[-] only affects the precise FP exceptions

mode.

** It is illegal to specify –fp-model except in an unsafe value safety mode.

Page 53: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

New Parallelism Method: Intel® Cilk™ Plus

An extension to C and C++ for expressing fine-grained task parallelism

• Shared-memory multiprocessing (like OpenMP)

Very simple syntax of 3 keywords only: _Cilk_spawn and _Cilk_sync, _Cilk_for

• #include <cilk/cilk.h> in order to get cilk_spawn, cilk_sync, and cilk_for

Every Cilk program preserves the serial semantic

Cilk provides performance guarantees since it is based on theoretically efficient work-stealing scheduler

Preventing races using reducer hyperobjects

Array Notations to provide data parallelism for sections of arrays or whole arrays

Elemental Functions to enable data parallelism of whole functions or operations

#pragma SIMD to express vector parallelism using SIMD hardware registers

53

Page 54: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Key Files Supplied with Compiler

Linux*

Intel compiler

• icc: C/C++ compiler

• compilervars.(c)sh: Source scripts to setup the

complete compiler/debugger/libraries environment

Linker driver

• xild: Invokes ld

Intel include files, libraries

54

Page 55: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Additional new compiler features

• –mtune=<ARCH> option on Linux*/OS X* to specify

cpu targeting without generating instructions exclusive to that cpu

•“no_false_share” attribute to avoid false sharing in data structures. • DWARF4 support on Linux*/OS X*

55

Page 56: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Wind River* Application Cross-Build from Windows* Host

56

1. Set environment variables:

• WRL_TOOLCHAIN • WRL_SYSROOT

Example: Wind River* Linux* 4.3 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl43\wrlinux-4\layers\wrll-toolchain-4.4a-341\i586\toolchain\x86-win32

set WRL_SYSROOT=<some_path>\wrl43\intel64\export\sysroot\common_pc_64-glibc_std\sysroot

Wind River* Linux 5.0.x 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl50\wrlinux-5\layers\wr-toolchain\4.6-60-win32

set WRL_SYSROOT=<some_path>\wrl50\intel64\export\sysroot\intel-xeon-core_glibc_std\

bitbake_build\tmp\sysroots\intel-xeon-core

2. Build application: • C Source:

icc.exe -platform=wrl50 my_source_file.c

• C++ Source” icpc.exe -platform=wrl50 my_source_file.cpp

Page 57: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Summary

Intel® C++ Compiler 14.0 for applications running on Embedded OS Linux*

• High level optimizations

• Auto-vectorization/-parallelization to parallelize serial code

• Sophisticated programming methods for multithreading

• Runs on GNU environments or integrates into Eclipse (Linux*)

More information on Intel’s software offerings and services at http://software.intel.com

57

Page 58: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

58

2/19/2014

Page 59: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Intel® Compiler 14.0 for Android*

Page 60: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Content

• Introduction

• The Seven Steps of Optimization

• Android* Integration

• ARM* Neon vs. Intel SSE

60 Intel Confidential

Page 61: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Introducing the Intel® Compiler for Android*

• Based on Intel® C/C++ Compiler XE 14.0 for Linux*

• High performance C/C++ compiler

• Atom optimization

• Vectorization for loops - SIMD

• Interprocedural Optimization (IPO)

• Profile-Guided Optimizations (PGO)

61 Intel Confidential

Page 62: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What’s new

• Support for Silvermont architecture

• Optimization switch-xatom_sse4.2

• Support for Android NDK r9.

• Intel® Cilk™ Plus runtime support is enabled for Android as a technology preview.

• Features from C++11 (-std=c++0x)

• 64-bit long double type support (for compatibility with new NDKs)

Intel Confidential 62

Page 63: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

When to use

• ICC can only be used for native source code

• You will get better speedup if

• The app is CPU bound (check with Intel GPA)

• The hot functions are not written in assembler

• Code can be vectorized

– Usually true for multimedia apps & games

• Code consists a lot of small helper functions (IPO)

• You want to Multithread your application (use Intel® Cilk™ Plus)

• You want to explicitly optimize for the latest CPU generation

Intel Confidential 63

Page 64: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Android* Integration

64 Intel Confidential

Page 65: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Differences to Intel® C/C++ Compiler XE 14.0 for Linux*

- Cross-compiler Linux Android*

- Some features are removed

- OpenMP

- Android* NDK or AOSP environment is required

65 Intel Confidential

Page 66: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Integration into the build environment

Three different options to compile Android* apps

1. Standalone tool chain

– Useful for own / 3rd party build systems

– Manual compile / link / package of application

2. Using ndk-build script

– Controlled by Android.mk and Application.mk files

– Automatically compile/link applications and store it the right folders for using it from the Android* SDK

3. As part of the AOSP

– Automatically integrates into platform build

66 Intel Confidential

Page 67: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 1: Standalone tool chain

• Establish the compiler environment

source <icc-install-dir>/bin/compilervars.sh

Intel C++ compiler can be used directly in this environment

Recommended

67 Intel Confidential

Page 68: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 2: Using ndk-build script

• Execute the ndk-build script in the project folder

ndk-build V=1 -B NDK_TOOLCHAIN=x86-icc APP_ABI=x86

68 Intel Confidential

Page 69: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: As part of the AOSP

• Two modes available:

• ICC as default compiler

• Compile only specified modules with ICC

• Ability to force compilation for particular modules with ICC or GCC independent of the default compiler

• Recommended is to start with compiling specified modules with ICC

Intel Confidential 69

Page 70: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: As part of the AOSP (a)

• Preparations

1. Request a patch set for your particular version of the source tree from your Intel representative

2. Apply the patch to your source tree

3. Check if ICC is already included in your source tree if the directory is empty or missing, make a symlink from /opt/intel/CCAndroid13.0.0.006/

Intel Confidential 70

ls prebuilts/PRIVATE/icc/linux-x86/x86/x86-android-linux-13.0

Page 71: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: As part of the AOSP (b)

• Compile specified modules with ICC

1. Edit the ICC configuration file

2. Specify modules you want to compiler with ICC

3. Build Android as usual

Intel Confidential 71

nano build/core/icc_config.mk

ICC_MODULES := libv8 libskia libskiagpu

source build/envsetup.sh

lunch

make

Page 72: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: As part of the AOSP (c)

• ICC as default compiler

1. Edit the ICC configuration file

2. Specify modules you want to compiler with GCC

3. Change the default compiler to ICC

4. Build Android as usual

Intel Confidential 72

nano build/core/icc_config.mk

GCC_MODULES := …

source build/envsetup.sh

lunch

make

DEFAULT_COMPILER:=icc

Page 73: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: As part of the AOSP (d)

• Checking the result:

• Check if your module was built with ICC. You should see several lines of output for this command:

• Check if the Intel libraries are copied on the device

Intel Confidential 73

readelf -s out/target/product/redhookbay/system/

lib/libskia.so |grep intel

adb shell

root@android:/ # ls /system/lib/libsvml.so

Page 74: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel Libraries

• ICC comes with four optimized libraries

• The final binary requires access to these libraries

• Options

1. Include them into OS image

2. Link statically

3. Copy them to the application directory

Library Description

libintlc.so Intel support libraries

libimf.so Intel math library

libsvml.so Short vector math library

libirng.so Random number generator

74 Intel Confidential

Page 75: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 1: Include into OS Image

• Libraries in the /system/lib folder are loaded automatically

• Remount the filesystem read/write

• Push the libraries on the target

cd /opt/intel/CCAndroid13.0.0.005/lib

adb push libintlc.so /system/lib

adb push libimf.so /system/lib

adb push libsvml.so /system/lib

adb push libirng.so /system/lib

Applications will automatically load the needed libraries

Recommended

adb shell mount -o remount,rw /system

75 Intel Confidential

Page 76: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 2: Link statically

• Best choice for single binary

• Default option

• If libraries shouldn’t linked in statically use option -shared-intel

Recommended

76 Intel Confidential

Page 77: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Option 3: Copy the libraries to the application directory

• Part of the Android* SDK/NDK functionality

• Add to the Android.mk file in the jni folder

• Local libraries are not loaded automatically, need to load them manually from JAVA

include $(CLEAR_VARS)

LOCAL_MODULE := libintlc

LOCAL_SRC_FILES := libintlc.so

include $(PREBUILT_SHARED_LIBRARY)

libimf

libsvml

libirng

System.loadLibrary("intlc");

System.loadLibrary("imf");

System.loadLibrary("svml");

System.loadLibrary(“irng");

System.loadLibrary("hello-jni");

77 Intel Confidential

Page 78: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Compatibility GCC / ICC

file1.c

file2.c

executable

GCC

ICC

file1.o

file2.o

GCC/ICC

78 Intel Confidential

Page 79: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using GAP on Android*

• Same option set as on Linux

• Recommended option set:

• Using GAP with standalone tool chain is recommended

• Using GAP with ndk-build

• No code generation for GAP linking phase will fail

or use outdated object files

79 Intel Confidential

-guide –diag-disable 30761

Page 80: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using PGO on Android*

• Generated data files need a storage location

• Default stored in the application directory usually

write protected on Android*

• Specify different storage location in Android.mk file:

• Application needs write permissions on sdcard. Add to AndroidManifest.xml:

LOCAL_CFLAGS := -prof-gen -prof-dir /sdcard

<uses-permission

android:name="android.permission.WRITE_EXTERNAL_STORAGE" />

80 Intel Confidential

Page 81: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

• Data files are only generated if application exits

• Application usually not exit on Android*

• Option 1: Call exit from Java:

• Option 2: Explicitly dump PGO data from native code

• Option 3: Using environment to make regular dumps

Using PGO on Android* (2)

System.exit(0);

#include <pgouser.h>

_PGOPTI_Prof_Dump_All();

export INTEL_PROF_DUMP_INTERVAL 5000

export INTEL_PROF_DUMP_CUMULATIVE 1

81 Intel Confidential

Page 82: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using Intel® Cilk™ Plus

• Change your STL to GNU shared and add exception support to you Application.mk file:

• Include the Cilk runtime library in app by adding to the Android.mk file:

• Load the libraries from your Java code

Intel Confidential 82

APP_STL := gnustl_shared

APP_GNUSTL_FORCE_CPP_FEATURES := exceptions rtti

include $(CLEAR_VARS)

LOCAL_MODULE := cilkrts.so

LOCAL_SRC_FILES := ../path/to/CCAndroid/lib/cilkrts.so

include $(PREBUILT_SHARED_LIBRARY)

System.loadLibrary("gnustl_shared");

System.loadLibrary("cilkrts");

Page 83: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Using Intel® Cilk™ Plus (2)

• Add a Cilk to your linker options(Android.mk):

• In your C/C++ file include the Cilk header

• And start using Cilk in your C/C++ code

Intel Confidential 83

#include <cilk/cilk.h>

LOCAL_LDLIBS += -lcilkrts

int fib(int n) {

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib(n-2);

cilk_sync;

return x + y;

}

Page 84: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

ARM* Neon vs. Intel SSE

84 Intel Confidential

Page 85: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Comparison

ARM v5 ARM v7a x86

32-bit 32-bit 32-bit

little-endian little-endian little-endian

Soft FP Hardware FP Hardware FP

64-bit vars aligned

64-bit vars aligned

64-bit vars packed

None NEON SSE This will require porting…

Normally not a problem…

85 Intel Confidential

Page 86: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Memory alignment

Force memory alignment

struct TestStruct

{

int mVar1;

long long mVar2;

int mVar3;

};

ARM

x86

-malign-double

86 Intel Confidential

Page 87: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Porting SIMD instructions

Porting NEON instructions (ARM*) to SSE

instructions (Intel)

– Fixed point arithmetic only on ARM*

– NEON native C libs can’t be reused in Intel® Atom™ based

platforms

87 Intel Confidential

http://intel.ly/10JjuY4 - NEONvsSSE.h wrap NEON functions and intrinsics to SSE3

Page 88: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Conclusion

- Based on the high performance Intel® C/C++ Compiler XE 13.0 for Linux*, widely used by HPC customers for archiving better performance on IA

- Comes with a well established support infrastucture

- Variaty of optimization options available

- Integration into various parts of the Android* environment

- Can be integrated in a standalone tool chain, the NDK and the AOSP

88 Intel Confidential

Page 89: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

89 Intel Confidential

Page 90: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

90

90

Intel Confidential

Page 91: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation
Page 92: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Backup

92 Intel Confidential

Page 93: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel® Cilk™ Plus Pragma/Directive

C/C++: #pragma simd [clause [,clause]…]

Without any clause, the directive enforces vectorization of the loop, ignoring all dependencies (even if they are proved!)

Without SIMD directive, vectorization likely fails since there are too many pointer references to do a run-time check for overlapping (compiler heuristic). The compiler won’t create multiple versions here.

void addfl(float *a, float *b, float *c, float *d, float *e, int n)

{

#pragma simd

for(int i = 0; i < n; i++)

a[i] = a[i] + b[i] + c[i] + d[i] + e[i];

}

93 Intel Confidential

Page 94: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The compiler cannot vectorize the loop, even though the arrays a and b won’t overlap (keyword restrict).

Also multi-versioning won’t help because of complexity of the offsets (off[]).

Using #pragma ivdep doesn’t work either because compiler regards accesses to off[] as inefficient here

Solution: If, for example, offsets are at least 4 elements, vectorization is still possible as vector length can be controlled via #pragma simd:

#pragma simd Example for C/C++

void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])

{

for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];

}

void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])

{

#pragma simd vectorlength(4)

for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];

}

94 Intel Confidential

Page 95: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Sample for movbe instruction

int a;

void foo (int x)

{

a = ((x & 0xff) << 24) |

((x & 0xff00) << 8) |

((x & 0xff0000) >> 8) |

((x & 0xff000000) >> 24);

return;

}

int main(int argc, char **argv)

{

foo(atoi(argv[1]));

printf("0x%8.8x\n", atoi(argv[1]));

printf("0x%8.8x\n", a);

return 0;

}

> icc –xSSSE3_ATOM –minstruction=movbe

95 Intel Confidential

Page 96: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Changes to the AOSP build system and sources

• Include Intel libraries to build environment

• Copy required libraries to target

• Necessary changes to source code ICC is more strict

• about 40 changes for the whole AOSP source tree

• Most of the changes solve existing problems

• Optional changes for better performance

• Loop restructuring

• Pragma SIMD

96 Intel Confidential

Page 97: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

97

2/19/2014

Page 98: Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training for System & Application code running Linux*, Android* & Tizen™ • In-depth explanation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Differences ICC Linux vs. ICC Android

• Android GCC compiler does not support native Linux thread local storage. Thread local storage is emulated as described in http://gcc.gnu.org/onlinedocs/gccint/Emulated-TLS.html#Emulated-TLS.

• Newer GCC uses DT_INIT_ARRAY/DT_FINI_ARRAY elements in .dynamic section for global object initialization as described in http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#init_fini. Previously addresses of constructors and destructors of global objects were placed in .ctors/.dtors sections correspondingly.

• Stack alignment is different: for Linux is 16 bytes while it’s 4 bytes on Android. • Long double is 64-bit on Android and 80-bit on Linux. • Driver has been changed to account differences in system lib names • Android NDK or GNU tools from the Android OS workspace is required to run the compiler and

2 environment variables must be set before invoking the compiler: ANDROID_SYSROOT and ANDROID_GNU_X86_TOOLCHAIN.

• Only IA-32 target is supported. • OpenMP runtime support is missing. • There is only experimental Cilk+ support in Android Compiler. C++ runtime support in

Android is provided by a different set of libraries comparing to Linux. The differences are in RTTI and exceptions. As a result exceptions thrown from Cilk threads are lost.

• Windows host is supported only in the experimental compiler. • Pointer Checker support is limited.

98