slides - scicomp

S c i c o m P 2 0 1 3 T u t o r i a l

Intel® Xeon Phi™ Product Family Programming Models

Klaus-Dieter Oertel, May 28th 2013

Software and Services Group Intel Corporation

Intel® Many Integrated Core Architecture

Software & Services Group, Developer Relations Division

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice

Agenda

• Overview

• Execution options

• Parallelization options

• Vectorization options

3




Range of models to meet application needs

Programming Models and Mindsets

Foo( ) Main( ) Foo( ) MPI_*( )

Main( ) Foo( ) MPI_*( )



Main( ) Foo( ) MPI_*( ) Multi-core

(Xeon)

Many-core

(MIC)

Multi-Core Centric Many-Core Centric

Multi-Core Hosted

General purpose serial and parallel

computing

Offload

Codes with highly- parallel phases

Many Core Hosted

Highly-parallel codes

Symmetric

Codes with balanced needs

Xeon MIC




Programming Models

Intel® Math Kernel Library MPI*

Intel® Threading Building Blocks

Intel® Cilk™ Plus

OpenMP*

Pthreads*

Intel® Math Kernel Library

Array Notation: Intel® Cilk™ Plus

Auto vectorization

Semi-auto vectorization: #pragma (vector, ivdep, simd)

OpenCL*

C/C++ Vector Classes (F32vec16, F64vec8)

Intrinsics

Ease of use

Fine control

Multicore Options Vector Options




Comparison of Options

• Programming models that abstract the target Intel architecture

– Intel® MKL

– Intel® Cilk™ Plus

– OpenMP*

– Intel® Threading Building Blocks

– Compiler-based full autovectorization

• Technologies that may involve architecture-specific user hints

– Compiler-based full autovectorization

o Ex: explicit cache-blocking of data

– Semiautomatic vectorization with annotation

o #pragma vector, #pragma ivdep, and #pragma simd

• Programming models that require writing architecture-specific code

– C/C++ Vector Classes (F32vec16, F64vec8)

– Vector intrinsics (mm_add_ps, addps)

– Pthreads*

10




Agenda

• Overview


– Native Execution

– Offload Execution



11




Intel® Xeon Phi™ Coprocessor runs as a native or MPI* compute node via IP or OFED

12

ssh or telnet connection to coprocessor IP address

Virtual terminal session

Linux* OS Linux* OS

PCI-E Bus PCI-E Bus

Intel® Xeon Phi™ Coprocessor communication

and application-launch support

Intel® Xeon Phi™ Coprocessor Host Processor

System-level code System-level code

User-level code User-level code

Target-side “native” application

User code

Standard OS libraries plus any 3rd-party or Intel

libraries

Intel® Xeon Phi™ Coprocessor Architecture support libraries,

tools, and drivers

IB fabric

Advantages • Simpler model

• No directives • Easier port

• Good kernel test

Use if • Not serial • Modest memory • Complex code • No hot spots

for offloading




The Intel® Manycore Platform Software Stack (Intel® MPSS) provides Linux* on the coprocessor

13

• Authenticated users can treat it like another node

• Intel MPSS supplies a virtual FS and native execution

• Add –mmic to compiles to create native programs

ssh mic0 top Mem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached

CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq

Load average: 1.00 1.04 1.01 1/2234 7265

PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND

7265 7264 fdkew R 7060 0.0 14 0.3 top

43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13]

5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8

5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser

7261 5667 root S 25744 0.3 6 0.0 sshd: fdkew [priv]

7263 7261 fdkew S 25744 0.3 241 0.0 sshd: fdkew@notty

5667 1 root S 21084 0.2 5 0.0 /sbin/sshd

5757 1 root S 6940 0.0 18 0.0 /sbin/getty -L -l /bin/noauth 1152

1 0 root S 6936 0.0 10 0.0 init

7264 7263 fdkew S 6936 0.0 6 0.0 sh -c top

sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64

scp a.out mic0:/tmp

ssh mic0 /tmp/a.out my-args

icc –O3 –g –mmic –o nativeMIC myNativeProgram.c




• Pros:

– Very easy to use

o Just compile and run

o Handling similar to any compute node (e.g. With NFS)

– Can help in the decision for or against an investment in offload programming

o Does my application benefit already from the coprocessor architecture

o Can it be optimized for better parallelization and vectorization

o No need to start immediately with offload programming (although it is fun)

• Cons:

– Limited memory

– Slow I/O

14

Pros and Cons of Native Execution




Agenda

• Overview


– Native Execution

– Offload Execution



15




Intel® Xeon Phi™ Coprocessor runs as an accelerator for offloaded host computation

16

Linux* OS

Intel® Xeon Phi™ Coprocessor support libraries, tools, and

drivers

Linux* OS

PCI-E Bus PCI-E Bus

Intel® Xeon Phi™ Coprocessor communication

and application-launch support

Intel® Xeon Phi™ Coprocessor Host Processor

System-level code System-level code

User-level code User-level code

Offload libraries, user-level driver, user-

accessible APIs and libraries

User code

Host-side offload application

User code

Offload libraries, user-accessible

APIs and libraries

Target-side offload application

Advantages • More memory available • Better file access • Host better on serial code • Better uses resources




Use the Offload capabilities of Intel® Composer XE to access the Coprocessor

• Offload directives in source code trigger Intel Composer to compile objects for both host and coprocessor

• When the program is executed and a coprocessor is available, the offload code will run on that target

– Required data can be transferred explicitly for each offload

– Or use Virtual Shared Memory (_Cilk_shared) to match virtual addresses between host and target coprocessor

• Offload blocks initiate coprocessor computation and can be synchronous or asynchronous

17

#pragma offload target(mic) inout(A:length(2000)) C/C++

!DIR$ OFFLOAD TARGET(MIC) INOUT(A: LENGTH(2000)) Fortran

#pragma offload_transfer target(mic) in(a: length(2000)) signal(a)

!DIR$ OFFLOAD_TRANSFER TARGET(MIC) IN(A: LENGTH(2000)) SIGNAL(A)

_Cilk_spawn _Cilk_offload asynch-func()




Offload using Explicit Copies – Modifier Example

18

float reduction(float *data, int numberOf)

{

float ret = 0.f;

#pragma offload target(mic) in(data:length(numberOf))

{

#pragma omp parallel for reduction(+:ret)

for (int i=0; i < numberOf; ++i)

ret += data[i];

}

return ret;

}

Note: copies numberOf*sizeof(float) elements to the coprocessor, not numberOf bytes – the compiler knows data’s type




Offload using Explicit Copies – Data Movement

• Default treatment of in/out variables in a #pragma offload

statement

– At the start of an offload:

o Space is allocated on the coprocessor

o in variables are transferred to the coprocessor

– At the end of an offload:

o out variables are transferred from the coprocessor

o Space for both types (as well as inout) is deallocated on the coprocessor

19

Host coprocessor

#pragma offload inout(pA:length(n)) {...}

Allocate

1

Copy back

4

Copy over

2

Free

5

pA

3




C/C++ Offload using Explicit Copies

C/C++ Syntax Semantics

Offload pragma #pragma offload <clauses>

<statement>

Allow next statement to execute on coprocessor or host CPU

Offload transfer #pragma offload_transfer

<clauses>

Initiates asynchronous data transfer, or initiates and completes synchronous data transfer

Offload wait #pragma offload_wait <clauses>

Specifies a wait for a previously initiated asynchronous activity

Keyword for variable & function definitions

__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and coprocessor

Entire blocks of code

#pragma offload_attribute(push,

target(mic))

#pragma offload_attribute(pop)

Mark entire files or large blocks of code for generation on both host CPU and coprocessor





Fortran Offload using Explicit Copies

Fortran Syntax Semantics

Offload directive !dir$ omp offload <clauses> <statement>

Execute next OpenMP* parallel construct on coprocessor

!dir$ offload <clauses> <statement>

Execute next statement (function call) on coprocessor

Offload transfer !dir$ offload_transfer <clause>

Initiates asynchronous data transfer, or initiates and completes synchronous data transfer

Offload wait

!dir$ offload_wait <clauses>

Specifies a wait for a previously initiated asynchronous activity

Keyword for variable/function definitions

!dir$ attributes offload:<mic> :: <ret-name> OR <var1,var2,…>

Compile function or variable for CPU and coprocessor





Offload using Explicit Copies – Modifiers

22

Clauses / Modifiers Syntax Semantics

Target specification target( name[:card_number] ) Where to run construct

Conditional offload if (condition) Boolean expression

Inputs in(var-list modifiersopt) Copy from host to coprocessor

Outputs out(var-list modifiersopt) Copy from coprocessor to host

Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes

Non-copied data nocopy(var-list modifiersopt) Data is local to target

Modifiers

Specify pointer length length(element-count-expr) Copy N elements of the pointer’s type

Control pointer memory allocation

alloc_if ( condition ) Allocate memory to hold data referenced by pointer if condition is TRUE

Control freeing of pointer memory

free_if ( condition ) Free memory used by pointer if condition is TRUE

Control target data alignment

align ( expression ) Specify minimum memory alignment on target

Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars




Offload using Explicit Copies - Rules & Limitations

• The HostCoprocessor data types allowed in a simple offload:

– Scalar variables of all types

o Must be globals or statics if you wish to use them with nocopy, alloc_if, or free_if (i.e. if they are to persist on the coprocessor

between offload calls)

– Structs that are bit-wise copyable (no pointer data members)

– Arrays of the above types

– Pointers to the above types

• What is allowed within coprocessor code?

– All data types can be used (incl. full C++ objects)

– Any parallel programming technique (Pthreads*, Intel® TBB, OpenMP*, etc.)

– Intel® Xeon Phi™ versions of Intel® MKL

23




Heterogeneous Compiler – Implicit: Offloading using _Cilk_offload Example

// Shared variable declaration for pi

_Cilk_shared float pi;

// Shared function declaration for

// compute

_Cilk_shared void compute_pi(int count)

{

int i;

#pragma omp parallel for \

reduction(+:pi)

for (i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

}

void findpi()

{

int count = 10000;

// Initialize shared global

// variables

pi = 0.0f;

// Compute pi on target

_Cilk_offload

pi /= count;

}

24

_Cilk_offload

compute_pi(count);

_Cilk_shared void compute_pi(int count)

{

int i;

#pragma omp parallel for \

reduction(+:pi)

for (i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

}




Offload using Implicit Copies (1)

• Section of memory maintained at the same virtual address on both the host and Intel® Xeon Phi™ coprocessor

• Reserving same address range on both devices allows – Seamless sharing of complex pointer-containing data structures

– Elimination of user marshaling and data management

– Use of simple language extensions to C/C++

25

Host Memory

coproc Memory

Offload code

C/C++ executable

Host

coprocessor

Same address range




Offload using Implicit Copies (2)

• When “shared” memory is synchronized

– Automatically done around offloads (so memory is only synchronized on entry to, or exit from, an offload call)

– Only modified data is transferred between CPU and coprocessor

• Dynamic memory you wish to share must be allocated with special functions: _Offload_shared_malloc(), _Offload_shared_aligned_malloc(), _Offload_shared_free(),

_Offload_shared_aligned_free()

• Allows transfer of C++ objects

– Pointers are no longer an issue when they point to “shared” data

• Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code

– E.g., locks, critical sections, etc.

This model is integrated with the Intel® Cilk™ Plus parallel extensions

26

Note: Not supported on Fortran - available for C/C++ only




Implicit: Keyword _Cilk_shared for Data and

Functions

27

What Syntax Semantics

Function int _Cilk_shared f(int x)

{ return x+1; }

Versions generated for both CPU and card; may be called from either side

Global _Cilk_shared int x = 0; Visible on both sides

File/Function static

static _Cilk_shared int x; Visible on both sides, only to code within the file/function

Class class _Cilk_shared x {…}; Class methods, members, and and operators are available on both sides

Pointer to shared data

int _Cilk_shared *p; p is local (not shared), can point to shared data

A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data

Entire blocks of code

#pragma offload_attribute(

push, _Cilk_shared)


Mark entire files or large blocks of code _Cilk_shared using this pragma





Implicit: Offloading using _Cilk_offload

28

Feature Example Description

Offloading a function call

x = _Cilk_offload func(y); func executes on

coprocessor if possible

x = _Cilk_offload_to (card_num)

func(y);

func must execute on

specified coprocessor

Offloading asynchronously

x = _Cilk_spawn _Cilk_offload

func(y);

Non-blocking offload

Offload a parallel for-loop

_Cilk_offload _Cilk_for(i=0; i<N;

i++)

{

a[i] = b[i] + c[i];

}

Loop executes in parallel on target. The loop is implicitly outlined as a function call.




Comparison of Offload Techniques (1)

29

Offload via Explicit Data Copying

Offload via Implicit Data Copying

Language Support Fortran, C, C++ (C++ functions may be called, but C++ classes cannot be transferred)

C, C++

Syntax Pragmas/Directives: • #pragma offload in

C/C++ • !dir$ omp offload

directive in Fortran

Keywords: _Cilk_shared and _Cilk_offload

Used for… Offloads that transfer contiguous blocks of data

Offloads that transfer all or parts of complex data structures, or many small pieces of data




Comparison of Offload Techniques (2)

30

Offload via Explicit Data Copying

Offload via Implicit Data Copying

Offloaded data allowed

Scalars, arrays, bit-wise copyable structures

All data types (pointer-based data, structures, classes, locks, etc.)

Which data is transferred

User has explicit control of data movement at start of each offload directive

All _Cilk_shared data

synchronized at start and end of _Cilk_offload

statements

When data movement occurs

Data transfer can overlap with offload execution

Not available

When offload code is copied to card

At first #pragma offload At program startup




Agenda

• Overview




31




Options for Thread Parallelism

Intel® Math Kernel Library

OpenMP*

Intel® Threading Building Blocks

Intel® Cilk™ Plus

Pthreads* and other threading libraries Programmer control

Ease of use / code maintainability

Choice of unified programming to target Intel® Xeon and Intel® Xeon Phi™!




Intel® Math Kernel Library (1)

• Highly optimized, multi-threaded (when appropriate) math libraries providing functions often found in:

– Engineering and Manufacturing

– Financial Services

– Geological/Energy Industries

• Automatically optimized for the platform on which the functions are called:

– Called on the Intel® MIC Architecture, they will make best use of SIMD and parallelism

• Intel® MKL Functional Domains with Intel® MIC Architecture support:

– Linear Algebra: BLAS, LAPACK, LINPACK

– Sparse BLAS

– Fast Fourier Transforms (FFT) 1D/2D/3D FFT

– Vector Math and Statistical Libraries

33




Intel® MIC

•Native execution (of course)

• Identical usage syntax on host and coprocessor

•Functions called from the host execute on the host, functions called from the coprocessor execute on coprocessor

–User is responsible for data transfer and execution management between the two domains

Intel® Math Kernel Library Use in Offload Code

Host

Hetero App

MIC optimized

Intel® MKL

<Offloaded code>

Host optimized

Intel® MKL Intel® MIC support stack

35

Native code




Intel® Math Kernel Library Offload Example

36

Calculates the new value of matrix C based on the matrix product of matrices A

and B, and the old value of matrix C

where α and β values are scalar coefficients

void foo(…) /* Intel MKL Offload Example */

{

float *A, *B, *C; /* Matrices */

#pragma offload target(mic)

in(transa, transb, N, alpha, beta) \

in(A:length(N*N)) \

in(B:length(N*N)) \

inout(C:length(N*N))

sgemm(&transa, &transb, &N, &N, &N, &alpha,

A, &N, B, &N, &beta, C, &N);

}

C AB + C




Intel® Math Kernel Library Automatic Offload • Transparent load balancing between host and coprocessors

• Initiated by calling mkl_mic_enable() on the host before calling

Intel® MKL functions that implement Automatic Offload

• Call the function from the host code

– No “_Cilk_offload” or “#pragma offload” needed

– Intel® MKL is responsible for data transfer and execution management

Host Side

User’s App

Intel® MIC Side

Intel® MIC optimized

Intel® MKL Intel® MKL Worker process

Host Optimized Intel® MKL

Hetero Intel® MKL

library

37

Transparent load balancing

Intel® MIC support stack




Options for Parallelism – OpenMP*

• OpenMP* works the same on the Intel® MIC Architecture as it does on the host (beyond number of cores and some affinity tuning)

• Most of the constructs in OpenMP* are compiler directives or pragmas. A utility library and environment variables control the execution context.

• OpenMP* is fundamentally the same between Fortran and C/C++

• Fork-Join Parallelism:

– Master thread calls upon a team of threads as needed

– Parallelism is added incrementally: that is, the sequential program evolves into a parallel program

39

Parallel Regions

Master

Thread




OpenMP* parallelism can be easily finessed into offload parallelism

• OpenMP is a simple way to add parallelism

41

#pragma omp parallel for for (i=1; i < N-1; ++i) B2[i] = m * B1[i-1] + n * B1[i] + p * B1[i+1];

!$OMP PARALLEL DO do I=2, N-1 B2(I) = m * B1(I-1) + n * B1(I) + p * B1(I+1) end do




OpenMP* parallelism can be easily finessed into offload parallelism

• OpenMP is a simple way to add parallelism

42

#pragma omp parallel for for (i=1; i < N-1; ++i) B2[i] = m * B1[i-1] + n * B1[i] + p * B1[i+1];

!$OMP PARALLEL DO do I=2, N-1 B2(I) = m * B1(I-1) + n * B1(I) + p * B1(I+1) end do

Offload easily shifts that parallelism to the coprocessor #pragma offload target(mic: CPID) if (N > Limit) #pragma omp parallel for for (i=1; i < N-1; ++i) B2[i] = m * B1[i-1] + n * B1[i] + p * B1[i+1];

!DIR$ OFFLOAD BEGIN target(mic : CPID) IF (N .GT. Limit) !$OMP PARALLEL DO do I=2, N-1 B2(I) = m * B1(I-1) + n * B1(I) + p * B1(I+1) end do !DIR$ END OFFLOAD




Options for Parallelism – OpenMP* Example - Parallel Sections

• Specify blocks of code to execute in parallel

• Almost identical functionality can be obtained using the OpenMP* 3.0 task construct

Serial

Parallel

#pragma omp parallel sections

{

#pragma omp section

task1();

#pragma omp section

task2();

#pragma omp section

task3();

}

47




Simultaneous Host/Coprocessor Computing – Using OpenMP*

• Simply use OpenMP* task on host to spawn

the offload call

– Then use OpenMP* for parallelism on the coprocessor

• Use other OpenMP* task calls to

simultaneously run code on the host

48

#pragma omp parallel

#pragma omp single

{

#pragma omp task

#pragma offload target(mic) …

{

<various serial code>

#pragma omp parallel for

for (int i=0; i<limit; i++)

<parallel loop body>

}

#pragma omp task

{<host code or another offload>}

}




Options for Parallelism – Intel® Threading Building Blocks (Intel® TBB)

• Intel® TBB works the same on the Intel® MIC Architecture as it does on the host (beyond number of cores)

• Consists of C++ classes and templates that implement task-based parallelism

– Threads persist in a pool, dispatched upon lighter weight “tasks”

– Makes use of “work stealing” to evenly distribute work across threads and ensure good cache behavior

• Provides a wide range of services needed to implement efficient parallel algorithms

– Generic parallel patterns

– Concurrent containers

– Synchronization primitives

– Dynamic memory management

– Task scheduling

– Thread local storage

– Etc.

50




Options for Parallelism – Intel® TBB - Services

51

Concurrent Containers

concurrent_hash_map

concurrent_queue

concurrent_bounded_queue

concurrent_vector

concurrent_unordered_map

Miscellaneous tick_count

Generic Parallel Algorithms

parallel_for(range)

parallel_reduce

parallel_for_each(begin, end)

parallel_do tbb::graph

parallel_invoke

pipeline, parallel_pipeline

parallel_sort

parallel_scan

Task scheduler

task_group

task_structured_group

task_scheduler_init

task_scheduler_observer

Synchronization Primitives

atomic; mutex; recursive_mutex;

spin_mutex; spin_rw_mutex;

queuing_mutex; queuing_rw_mutex;

reader_writer_lock; critical_section;

condition_variable;

lock_guard; unique_lock;

null_mutex; null_rw_mutex;

Memory Allocation

tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator

Threads

tbb_thread, thread

Thread Local Storage

enumerable_thread_specific

combinable




Example: Affinity Partitioner

• Applicable to parallel_for, parallel_reduce, and parallel_scan

•

void sum(int* result, const int* a, const int* b, std::size_t size) { static affinity_partitioner partitioner; parallel_for( blocked_range<std::size_t>(0, size), [=](const blocked_range<std::size_t>& r) { for (std::size_t i = r.begin(); i != r.end(); ++i) { result[i] = a[i] + b[i]; } }, partitioner ); }

52 5/28/2013




Intel® Cilk™ Plus Keywords for Task Parallelism

• Array notation provides single-core SIMD-based data parallelism

• Fine-grained task parallelism provided by new keyword extensions to C and C++

– Shared memory multiprocessing, not distributed memory

• Three keywords

_Cilk_spawn (cilk_spawn with cilk/cilk.h)

o Function call may be run in parallel with caller – determined at runtime

_Cilk_sync (cilk_sync with cilk/cilk.h)

o Caller waits for all children to complete

_Cilk_for (cilk_for with cilk/cilk.h)

o Any or all iterations may execute in parallel with one another.

o All iterations complete before program continues.

o Subject to the constraint of a single control variable

56




Intel® Cilk™ Plus keywords for Task Parallelism - Examples

Spawn and sync

Parallel for

57

int fib(int n)

{

int x, y;

if (n < 2) return n;

x = _Cilk_spawn fib(n-1); // non-blocking – x & y values

y = fib(n-2); // calculated in parallel

_Cilk_sync; // wait here until both done

return x+y;

}

_Cilk_for (int i = begin; i < end; i += 2)

f(i);

_Cilk_for (T::iterator i(vec.begin()); i != vec.end(); ++i)

g(i);




Simultaneous Host/Coprocessor Computing – Using Intel® Cilk™ Plus

• _Cilk_spawn statements in host code can be combined with statements marked with _Cilk_offload to allow asynchronous

processing

– Example: x = _Cilk_spawn _Cilk_offload func(y);

<Host code runs simultaneously on host while _Cilk_offload

code runs on coprocessor>

_Cilk_sync; // Host here waits for all spawns to complete

• Note that the following is not asynchronous, but instead blocks in this host thread until the _Cilk_for completes on the coprocessor

_Cilk_offload _Cilk_for (i=0;i<10000;i++)

{

<loop_body;>

}

58




Pthreads*

• Pthreads* works the same on the Intel® MIC Architecture as it does on the host (beyond number of cores and some affinity tuning)

– Just do Pthread* programming like you normally would

– Load balancing and thread safety are entirely up to you

• Avoid core 0 (threads 0-3) using the thread affinity API, as that is where the card OS and interrupt-processing occur

– High-level parallel models do this automatically

• Support for other threading libraries will require porting them to the Intel® MIC Architecture

• Be especially careful to make sure all code called by pthread* functions is marked __attribute__((target(mic))) or _Cilk_shared

60




Pthreads* Partial Example

61

#pragma offload_attribute(push,target(mic))

#include <pthread.h>

#include <pthread_affinity_np.h>


__attributes__((target(mic))) const int

N_THREADS(124);

__attributes__((target(mic))) pthread_t

_knfThreads[N_THREADS];

__attributes__((target(mic))) KnfReductionTask

_knfTasks[N_THREADS];

…

…

__attributes__((target(mic)))

void _knfReductionPThreadsCleanup()

{

for(int i = 0; i < N_THREADS; i++) {

_knfTasks[i].alive = false;

}

pthread_barrier_wait(_knfTasks[0].barrier);

for (int i = 0; i < N_THREADS; i++) {

pthread_join(_knfThreads[i], NULL);

}

pthread_barrier_destroy(_knfTasks[0].barrier);

}

void reductionPThreadsInit()

{


_knfReductionPThreadsInit();

}

float reduction(float *data, size_t size)

{

float ret(0.f);

#pragma offload target(mic) in(data:length(size))

ret = _knfReductionPThreads(data, size);

return ret;

}

void reductionPThreadsCleanup()

{


_knfReductionPThreadsCleanup();

}




Agenda

• Overview




64




Software Behind the Vectorization

float *restrict A, *B, *C;

for(i=0;i<n;i++){

A[i] = B[i] + C[i];

}

66

• [SSE2] 4 elems at a time addps xmm1, xmm2

• [AVX] 8 elems at a time vaddps ymm1, ymm2, ymm3

• [IMCI] 16 elems at a time vaddps zmm1, zmm2, zmm3

Vector (or SIMD) Code computes more than one element at a time.

X3

Y3

X3opY3

0 127

X2

Y2

X2opY2

X1

Y1

X1opY1

X0

Y0

X0opY0

X7

Y7

X7opY7

128 255

X6

Y6

X6opY6

X5

Y5

X5opY5

X4

Y4

X4opY4

X11

Y11

X11opY11

256 383

X10

Y10

X10opY10

X9

Y9

X9opY9

X8

Y8

X8opY8

X15

Y15

X15opY15

384 512

X14

Y14

X14opY14

X13

Y13

X13opY13

X12

Y12

X12opY12

X87 SSE 2 AVX IMIC




Options for SIMD Width Abstraction

68

Intel® MKL Libraries

Array Notation: Intel® Cilk™ Plus

Automatic vectorization

Semiautomatic vectorization with annotation: #pragma vector, #pragma ivdep, and #pragma simd

C/C++ Vector Classes (F32vec16, F64vec8)

Vector intrinsics (mm_add_ps, addps) Programmer control

Ease of use / code maintainability

(depends on problem)




INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

101

slides - scicomp

Documents