parallel programming - all-electronics · go parallel with coarray fortran. intel® fortran...

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Parallel Programming The Ultimate Road to Performance

April 16, 2013

1

Werner Krotz-Vogel

http://software.intel.com/en-us/articles/optimization-notice

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E

2

Getting started with parallel algorithms • Concurrency is a general concept

– … multiple activities that can occur and make progress at the same time.

• A parallel algorithm is any algorithm that uses concurrency to solve a problem of a given size in less time

• Scientific programmers have been working with parallelism since the early 80’s

– Hence we have almost 30 years of experience to draw on to help us understand parallel algorithms.



Develop & Parallelize Today for Maximum Performance

Use One Software Architecture Today. Scale Forward Tomorrow.

Cluster

Multicore Cluster

Enabling & Advancing Parallelism High Performance Parallel Programming

Code

Compiler Libraries

Parallel Models

Multicore & Many -core

Cluster

Many-core

Multicore CPU

Intel® Xeon Phi™ coprocessor

Multicore

Multicore CPU

Intel tools, libraries and parallel models extend to multicore, many-core and heterogeneous computing



Intel® Software Development Products Deliver Application Performance

Foundation of Performance, Productivity, and Standards

Advanced Performance Cluster Performance

Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk™ Plus, Intel® TBB Library, Intel® IPP Library

Intel® Trace Analyzer and Collector

Intel® MPI Library

Intel® Parallel Studio XE



A Family of Parallel Programming Models Developer Choice

Intel® Cilk™ Plus C/C++ language extensions to simplify parallelism

Open sourced

Also an Intel product

Intel® Threading Building Blocks

Widely used C++ template library for parallelism

Open sourced

Also an Intel product

Domain-Specific Libraries

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Established Standards

Message Passing Interface (MPI)

OpenMP*

Coarray Fortran

OpenCL*

Offload Extensions

Research and Development

Intel® Concurrent Collections

Intel® SPMD Parallel Compiler

Choice of high-performance parallel programming models

Applicable to Multicore and Many-core Programming



Invest in Common Tools and Programming Models

Intel® Xeon® processors are designed for intelligent performance and smart

energy efficiency

Continuing to advance Intel® Xeon® processor family and instruction set (e.g., Intel®

AVX, etc.)

Multicore

Intel® Xeon Phi™ coprocessors are ideal for highly parallel computing

applications

Software development platforms ramping now

+

Many-core

Tomorrow

Use One Software Architecture Today. Scale Forward Tomorrow.

Code

Today

Use One Software Architecture

+



void foo() /* Intel® Math Kernel Library */ { float *A, *B, *C; /* Matrices */ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); }

Go Parallel with High Performance Math Kernel Library Intel® Math Kernel Library (Intel® MKL)

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

Implicit automatic offloading requires no code changes, simply link with the offload MKL Library

Intel High Performance Math Kernel Library is Applicable to Multicore and Many-core Programming



Go Parallel with Intel® Cilk™ Plus

• Proven Cilk parallel model, teachable in one minute – Parallelism in Three Key Words:

• cilk_spawn • cilk_sync • cilk_for

• Cilk™ Plus: an open specification

– Recently placed into open source by Intel for the advancement of parallel programming

Learn more at http://cilkplus.org

// Parallel function invocation, in C cilk_for (int i=0; i<n; ++i){ Foo(a[i]); }

// Parallel spawn in a recursive fibonacci // computation, in C int fib (int n) { if (n < 2) return 1; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Intel® Cilk™ Plus is Applicable to Multicore and Many-core Programming



//pragma SIMD: User-mandated // vectorization #pragma simd for (i=0; i<n; i++) { A[i] = A[i]+ B[i] + C[i]; }

// Simplify operation using // array notations in C/C++: a[:] = b[:] + c[:];

// Elemental functions, in C, // using Cilk Plus: __declspec (vector) void saxpy(float a, float x, float &y) { y += a * x; }

Go Parallel with Intel® Cilk™ Plus

• Data and Task Parallelism as first class citizens in C and C++ – Vectorization via intuitive

notations that automatically span MMX, SSE, AVX, and wider widths in the future including those in the Intel® Xeon Phi™ coprocessors

• array notations • #pragma SIMD controls • elemental functions

Learn more at http://cilkplus.org

Intel® Cilk™ Plus is Applicable to Multicore and Many-core Programming



Go Parallel with Intel® Threading Building Blocks (Intel® TBB)

• A popular parallel abstraction for C++ developers

– A C++ template library – Scalable memory allocation – Load-balancing – Work-stealing task scheduling – Thread-safe pipeline – Concurrent containers – High-level parallel algorithms – Numerous synchronization primitives

• Intel remains a leading participant and contributor in the TBB open source project as well as a leading supplier of TBB support and supporting tool.

//Parallel function invocation example, in C++, //using TBB:

parallel_for (0, n, [=](int i) { Foo(a[i]);

});

Learn more at http://threadingbuildingblocks.org

Intel® TBB is Applicable to Multicore and Many-core Programming



Intel® Threading Building Blocks

Concurrent Containers Concurrent access, and a scalable

alternative to containers that are externally locked for thread-safety

Miscellaneous Thread-safe timers

Generic Parallel Algorithms Efficient scalable way to exploit the power of

multi-core without having to start from scratch

Task scheduler The engine that empowers parallel

algorithms that employs task-stealing to maximize concurrency

Synchronization Primitives User-level and OS wrappers for

mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition

variables

Memory Allocation Per-thread scalable memory manager and false-sharing free allocators

Threads OS API wrappers

Thread Local Storage Scalable implementation of thread-local data that supports

infinite number of TLS

TBB flow graph

11



struct body { std::string my_name; body( const char *name ) : my_name(name) {} void operator()( continue_msg ) const {

printf("%s\n", my_name.c_str()); }

}; int main() { graph g; broadcast_node< continue_msg > start; continue_node< continue_msg > a( g, body("A") ); continue_node< continue_msg > b( g, body("B") ); continue_node< continue_msg > c( g, body("C") ); continue_node< continue_msg > d( g, body("D") ); continue_node< continue_msg > e( g, body("E") ); make_edge( start, a ); make_edge( start, b ); make_edge( a, c ); make_edge( b, c ); make_edge( c, d ); make_edge( a, e ); for (int i = 0; i < 3; ++i ) { start.try_put( continue_msg() ); g.wait_for_all(); } return 0; }

f()

f()

f()

f() f()

A B

C

D

E

12

TBB Flow Graph Dependence Example



Go Parallel with Message Passing Interface (MPI) Intel® Message Passing Interface (Intel® MPI)

• Extend your cluster solutions to the Intel® Xeon Phi™ coprocessor – E.g., Intel Xeon Phi™ coprocessor

in every node of the cluster using Intel® MPI and Intel® Threading Building Blocks and/or Intel® Cilk™ Plus on nodes

– Same model as an Intel® Xeon processor based cluster .

Learn more at http://intel.com/go/mpi

Intel is a leading vendor of MPI implementations and tools

Clusters with Multicore and Many-core

… …

Multicore Cluster

Clusters

MPI is applicable to Multicore and Many-core Programming



Go Parallel with Coarray Fortran Intel® Fortran Compiler

• A standard, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax.

• For parallel programming on both shared memory and distributed memory systems

!Sum in Fortran, using co-array feature:

REAL SUM[*] CALL SYNC_ALL( WAIT=1 ) DO IMG= 2,NUM_IMAGES() IF (IMG==THIS_IMAGE()) THEN SUM = SUM + SUM[IMG-1] ENDIF CALL SYNC_ALL( WAIT=IMG ) ENDDO

Learn more at http://intel.com/software/products

Coarray Fortran is Applicable to Multicore and Many-core Programming




main() { double pi = 0.0f; long i; for (i=0; i<N; i++) { double t = (double)((i+0.5)/N); pi += 4.0/(1.0+t*t); } printf("pi = %f\n",pi/N); }

#pragma omp parallel for reduction(+:pi) #pragma offload target (mic)

OpenMP* is Applicable to Multicore and Many-core Programming

One Line Change to Offload to the


Go Parallel with OpenMP* Intel® C/C++ and Fortran Compilers (C Example)




do i=1,10 A(i) = B(i) * C(i) enddo !$omp end parallel do

!$omp parallel do !dir$ omp offload target(mic)

Go Parallel with OpenMP* Intel® C/C++ and Fortran Compilers (Fortran Example)

OpenMP* is Applicable to Multicore and Many-core Programming

One Line Change to Offload to the




Go Parallel with C/C++ Language Extensions

• Simple Keyword Language Extensions to control offloading to Intel Xeon Phi™ coprocessor

C/C++ Language Extensions to Multicore and Many-core Programming

C/C++ Language Extensions class _Shared common { int data1;

char *data2;

class common *next;

void process();

};

_Shared class common obj1, obj2;

… _Cilk_spawn _Offload obj1.process(); _Cilk_spawn obj2.process();

…



Use the Same Code for Execution on Intel® Xeon Phi™ coprocessors by Offloading

• C/C++ Offload Pragma

#pragma offload target (mic)

#pragma omp parallel for reduction(+:pi)

for (i=0; i<count; i++) {

float t = (float)((i+0.5)/count);

pi += 4.0/(1.0+t*t);

}

pi /= count;

MKL Implicit Offload

//MKL implicit offload requires no source code changes, simply link with the offload MKL Library.

• MKL Explicit Offload

#pragma offload target (mic) \

in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

in(C:length(matrix_elements)) \

out(C:length(matrix_elements)alloc_if(0))

sgemm(&transa, &transb, &N, &N, &N, &alpha,

A, &N, B, &N, &beta, C, &N);

• Fortran Offload Directive

!dir$ omp offload target(mic)

!$omp parallel do

do i=1,10

A(i) = B(i) * C(i)

enddo

!$omp end parallel

C/C++ Language Extensions

class _Shared common {

int data1;

char *data2;

class common *next;

void process();

};

_Shared class common obj1, obj2;

…

_Cilk_spawn _Offload obj1.process();

_Cilk_spawn obj2.process();

…



Parallelism with OpenCL* Intel® OpenCL SDK

• OpenCL* is a framework for writing programs that execute across heterogeneous platforms (e.g., CPUs, GPUs, many-core)

• Intel is a leading participant in the OpenCL* standard efforts, and a vendor of solutions

and related tools with early implementations available today.

• OpenCL* addresses the needs of customers in specific segments

//Simple per element multiplication using OpenCL*:

kernel void dotprod( global const float *a, global const float *b, global float *c) { int myid = get_global_id(0); c[myid] = a[myid] * b[myid]; }

Learn more at http://intel.com/go/opencl

OpenCL is applicable to multicore and many-core programming



Intel Host Processor

Multicore

Running your Application Execution on the host and Intel® Xeon Phi™ coprocessor

Intel® Xeon Phi™ coprocessor(s)

Many-core

Host Offload Library

Message Library

Target Offload Library

Message Library

Without: Intel® Xeon Phi™ coprocessor(s) are absent

With: Intel® Xeon Phi™ coprocessor(s) are present

Application starts and executes on host

Application starts on host and executes portions on Intel® Xeon Phi™ coprocessor(s)

At runtime, if Intel® Xeon Phi™ coprocessor (s) are available, the target binary is loaded

At each offload, the construct runs on host cores/threads

At each offload, the construct runs on the Intel® Xeon Phi™ coprocessor(s)

Normal program termination on host

At program termination, target binary is unloaded

Your Application With identified

Compute Intensive Kernels

Execution Flow

Your Application With identified

Compute Intensive Kernels



Intel® MPI/Thread Environment Support

The execution command mpirun of Intel® MPI reads argument sets from the command line:

Sections between „:“ define an argument set (alternatively a line in a configfile specifies a set)

Host, number of nodes, but also environment can be set independently in each argument set

# mpirun –env I_MPI_PIN_DOMAIN 4 –host myXEON ... : -env I_MPI_PIN_DOMAIN 16 –host myMIC

Adapt the important environment variables to the architecture

OMP_NUM_THREADS, KMP_AFFINITY for OpenMP CILK_NWORKERS for Intel® CilkTM Plus

21

* Although locality issues apply as well, multicore threading runtimes are by far more expressive, richer, and with lower overhead.



Analyzing your Application Performance Analysis Tools

• Intel® VTune™ Amplifier XE performance profiler – Analyze your multicore and many-core performance

• Analyze performance of the application in offload mode

• Support for Intel® Xeon Phi™ coprocessors includes:

– A Linux* hosted command line tool that collects events

– The VTune™ Amplifier XE graphical user interface to display results collected in previous step highlighting bottlenecks, time spent and other details of performance.



GDB* on Intel® Xeon Phi™ Coprocessor

• GDB* supports Intel® Xeon Phi™ Coprocessor

• Intel upstreams features and capabilities to GNU* community

• Broad enabling of developers and software tools ecosystem

• Available from Intel at http://software.intel.com

23

4/16/201


http://software.intel.com/


The GNU* Project Debugger and Intel® Xeon Phi™ Coprocessor • Native and cross-debugger versions of GDB*

exist for the Intel® Xeon Phi™ coprocessor • It is part of the Intel® Manycore Platform

Software Stack (Intel® MPSS) • http://software.intel.com/en-us/articles/intel-

manycore-platform-software-stack-mpss You can debug with it as either root or a user

24

Intel Confidential – NDA presentation


http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss

http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss


Native debugging on the Intel® Xeon Phi™ Coprocessor with GDB*

25

• Run GDB* on the Intel® Xeon Phi™ Coprocessor ssh –t mic0 /usr/bin/gdb

– To attach to a running application via the process-id

(gdb) shell pidof my_application

42

(gdb) attach 42

– To run an application directly from GDB* (gdb) file /target/path/to/application

(gdb) start

Intel Confidential – NDA presentation



Remote debugging with GDB* for Intel® Xeon Phi™ Coprocessor

26

• Run GDB* on your localhost

/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb

Start gdbserver on the Intel® Xeon Phi™Coprocessor • To remote debug using |ssh (gdb) target extended-remote | ssh –T mic0 gdbserver –multi IP:port

• To remote debug using stdio (gdb) target extended-remote | ssh -T mic0 gdbserver –multi -

To attach to a running application via the process-id (pid) (gdb) file /local/path/to/application

(gdb) attach <remote-pid>

To run an application directly from GDB* (gdb) file /local/path/to/application

(gdb) set remote exec-file /target/path/to/application



Explore Intel® Xeon Phi™ Coprocessor Architecture Features

27 4/16/2013

List all new vector and mask registers (gdb) info registers zmm k0 0x0 0 ⁞ zmm31 {v16_float = {0x0 <repeats 16 times>}, v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v64_int8 = {0x0 <repeats 64 times>}, v32_int16 = {0x0 <repeats 32 times>}, v16_int32 = {0x0 <repeats 16 times>}, v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_uint128 = {0x0, 0x0, 0x0, 0x0}}

Disassemble Instructions • (gdb) disassemble $pc, +10 • Dump of assembler code from 0x11 to 0x24: • 0x0000000000000011 <foobar+17>: vpackstorelps %zmm0,-

0x10(%rbp){%k1} • 0x0000000000000018 <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0



Intel® Software Tools Roadmap

High Performance Computing / Enterprise

Intel® Parallel Studio XE 2013 -Support for Intel® Xeon Phi™ Coprocessors (Linux)

Q3 ’13 Q3 ’12 Q4 ’12 Q2 ’13 Q1 ’13

Gold release Beta Release window Alpha

Intel® Cluster Studio XE 2013 -Support for Intel® Xeon Phi™ Coprocessors (Linux)

Intel® Parallel Studio XE NEXT

Intel® Cluster Studio XE NEXT

Many-Core

Data Center Tools

Beta release window for Microsoft Windows*

Intel® Xeon Phi™ Coprocessor Support for Windows* (Beta)

Intel® Xeon Phi™ Coprocessor Support for Windows* (Alpha)

Intel® Cluster Studio XE 2012



Preserve Your Development Investment Common Tools and Programming Models for Parallelism

Multicore

Many-core

Heterogeneous Computing

Intel® Cilk Plus

Intel® TBB Offload Pragmas

OpenCL*

OpenMP*

OpenMP*

Coarray

Offload Directives

Intel® MPI

Intel® MKL

C/C++

Fortran

Intel® C/C++ Compiler

Intel® Fortran Compiler

Develop Using Parallel Models that Support Heterogeneous Computing



Conclusion • There are many parallel programming models in

existence. But only a small number are actually used and standardized across platforms:

• OpenMP • MPI • TBB • Cilk • Pthreads • OpenCL

• All you do to make applications run well on Intel Xeon Phi coprocessors (vectorization, parallelization) can be done in above ways (OpenMP, MPI, etc.) - it also works on Intel Xeon, and typically improves performance there too.

30



Call to Action

• Evaluate the Intel® Software Development Products, including the family of Parallel Programming Models, for your High Performance needs:

http://www.intel.com/software/products/eval

• For product information see:

http://www.intel.com/software/products

Note: The Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 products include support for Intel® Xeon Phi™ coprocessors prior to the coprocessors being generally available.


http://www.intel.com/software/products/eval

http://www.intel.com/software/products

http://software.intel.com/en-us/articles/intel-software-evaluation-center/

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 32



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

33

4/16/201

Intel Confidential - Use under NDA only

33


parallel programming - all-electronics · go parallel with coarray fortran. intel® fortran...

Documents