sc’14 hpc dev. workshop session presentation template · vector parallel vector. 1 core no simd....

39
Herbert Cornelius Intel Software Modernization Taking full advantage of modern performance technologies

Upload: others

Post on 22-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Herbert CorneliusIntel

Software Modernization

Taking full advantage of modern performance technologies

Page 2: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

#1 Supercomputer Scaling on Top500*

What if … the ‘2003-2013 CAGR held to 2023?

Source: TOP500 stats--www.top500.org1Projected flops and cores calculated using the CAGR in flops and cores on the #1 system on the TOP500 from 2003 to 2013 and extrapolating to 2023.

31.9 EF1?

33 PF

35 TF

LOG

SC

ALE

2003 2013 2023 2003 2013 2023

1.9B1?

3.12M

5.12K

*Other logos, brands and names are the property of their respective owners.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

2

Page 3: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Parallel is your path forward.

INDUSTRY TREND

Page 4: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

What are the chances?

A code (probably in FORTRAN) written for this CPU will perform well on these?

(die sizes not to scale, for illustration only)IBM* 5110 in 1978www.rugged-portable.com/history-portable-computers-rugged-bias/ibm-portable-pc-5110

*Other logos, brands and names are the property of their respective owners.

Many-Core & SIMD

Multi-Core & SIMD

4

Page 5: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Intel® Xeon®

processor64-bit

Intel® Xeon®

processor 5100 series

Intel® Xeon®

processor 5500 series

Intel® Xeon®

processor 5600 series

Intel® Xeon®

processor code-named

Sandy Bridge EP

Intel® Xeon®

processor code-namedIvy Bridge EP

Intel® Xeon®

processor code-namedHaswell EP

Core(s) 1 2 4 6 8 12 18

Threads 2 2 8 12 16 24 36

SIMD Width 128 128 128 128 256 256 256

Intel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel

Intel® Xeon Phi™ coprocessorcode-named

Knights Corner

Intel® Xeon Phi™ processor & coprocessorcode-named

Knights Landing1

61 60+

244 240+

512 512

More Cores More Threads Wider Vectors

Product specification for launched and shipped products available on ark.intel.com. 1 Not launched or in planning.

Parallel is the Path Forward

(die sizes not to scale, for illustration only)

5

Page 6: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

“The Gap”

SOFTWARE MODERNIZATION

Page 7: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

The Urgency of Code Modernization

“This process is referred to as parallelization, code optimization or code modernization. As systems move toward exascale levels of performance, the problem of outdated code will only grow in urgency.

. . .

Driving the urgency for code modernization is recognition that it is both a technical and an economic imperative. Code not modernized equates to system performance and financial advantage left on the table. Outdated code equates to longer run times, more test and design iterations, more modeling within a given period of time, and slower simulations. Bottom line: higher operating costs and less revenue for organizations.”

- Doug Black

Source: www.scientificcomputing.com/articles/2014/12/hpc-community-experts-weigh-code-modernization

*Other logos, brands and names are the property of their respective owners. 7

Page 8: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Modernization is critical: But which path is best?

CUDA*, OpenACC* Investment locked into one architecture

OpenMP*, MPIReusable, Portable, Scalable

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Device-Specific Applications (Accelerators)

Applications Suitable for Many Architectures (CPU)

*Other logos, brands and names are the property of their respective owners. 8

Page 9: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Modernization is critical: But which path is best?

CUDA*, OpenACC* Investment locked into one architecture

OpenMP*, MPIReusable, Portable, Scalable

Device-Specific Applications (Accelerators)

Applications Suitable for Many Architectures (CPU)

*Other logos, brands and names are the property of their respective owners.

Economical

Sustainable

Open

9

Page 10: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Amdahl’s Law

Modern Version:

Enhancing a fraction f of a computation by speedup S,the overall speedup is:

Speedup S can be archived by both SIMD/Vectorization or Multi-Threading.

Speedupenhanced (f,S) = 1

1−𝑓𝑓 +𝑓𝑓𝑆𝑆

*Other logos, brands and names are the property of their respective owners.

Source: TR1593.pdf, http://minds.wisconsin.edu

10

Page 11: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Modernize Your SoftwarePerformance = Parallelism on all Levels

NODES clustering

SIMD vectorization

CORES multi-threading

ILP instruction parallelism

CACHE optimization, data-layout

Simplified example for a 61 cores and 8-way SIMD processor -- for illustration only

11

Page 12: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Analogy: Using all Resources:

Vector Parallel VectorSerial Scalar

1 core no SIMD 1 core with SIMD all cores with SIMD

12

Page 13: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Impact of using the entire CPU

2012Intel® Xeon® Processor E5-2600 family(codenamed Sandy Bridge EP)

2013Intel® Xeon® ProcessorE5 2600 v2 family(codenamedIvy Bridge EP)

2010Intel® Xeon® Processor X5680(codenamedWestmere)

2007Intel® Xeon® Processor X5472(codenamedHarpertown)

2009Intel® Xeon® Processor X5570(codenamedNehalem)

2014Intel® Xeon® Processor E5-2600 v3 family(codenamed Haswell EP)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Performance measured in Intel Labs by Intel employees. For more information go to http://www.intel.com/performance .

SS: Single threaded and ScalarVS: Vectorized and Single threadedSP: Scalar and ParallelVP: Vectorized and Parallelized

179x

Example:

Vectorized & Parallelized

13

Page 14: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

61

4

512-bit

352 GB/s

Unlock Parallel PotentialConsider an Intel® Xeon Phi™ Coprocessor for code development today

Intel® Xeon Phi™ x100 Product Family

formerly codenamed

Knights Corner

Intel® Xeon Phi™ x200 Product Family

codenamed

Knights Landing

Cores up to

Threads/Core

Vector Width

Peak Memory Bandwidth

Future Xeon

The world is going parallel – stick with

sequential code and you will fall

behind.

18

2

256-bit

68 GB/s

60+

4

512-bit

>350 GB/s

tba

tba

512-bit

tba

12

2

256-bit

59 GB/s

Intel® Xeon® Processor E5-2600 v2 Product

Family formerly codenamed

Ivy Bridge

Intel® Xeon® Processor E5-2600 v3 Product

Family formerly codenamed

Haswell

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

tba tba

Xeon Phi amplifies thread, core & wide vector scaling to be ready for tomorrow’s processors

Future(die sizes not to scale, for illustration only)

14

Page 15: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Application Modernization

*Other logos, brands and names are the property of their respective owners. 15

Page 16: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Case study compliments of Hansen Experimental Physics Laboratory, Stanford University* and Colfax International*

HEATCODE – Cosmology ApplicationBuilding a Model of our Galaxy processing spectral data

Objective: Reduce days of compute to hours

Intel® Xeon® processors are parallelPrograms should be too …

• Parallel Tasks

• Utilize

• Cores

• Threads

• SIMD where possible

Case Study: Tapping the Performance Reserve

http://arxiv.org/pdf/1311.4627.pdf

*Other logos, brands and names are the property of their respective owners. 16

Page 17: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

A real-life example

Colfax International* and Stanford* – HEATCODE application

• As advertised, the application ported quickly from Intel® Xeon® processors to the Intel® Xeon Phi™ coprocessor

• But, performance <1/3 of two Intel® Xeon® E5 Family CPUs

Why did performance decrease?

1. Move from >3 GHz out of order modern CPU to a low power 1 GHz in order core

2. Limited threading, no vectorization

*Other logos, brands and names are the property of their respective owners.

CPU = Xeon = Intel® Xeon ProcessorXeon Phi = Intel® Xeon Phi™ coprocessor

17

Page 18: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

What is the impact of modernization?

After optimization, the new many-core code:

• Runs up to 620x faster than it’s baseline!

• Runs up to 125x faster than the original Intel® Xeon® processor baseline!

• Well worth the time and effort

Results measured on Stanford University HEATCODE application in labs at Stanford University and Colfax International Configuration: Colfax CX2265i-XP5 server ( http://www.colfax-intl.com/nd/Servers/CX2265i-XP5.aspx ) based on two Intel Xeon E5-2680 processors with 64 GB of DDR3 memory at 1333 MHz. This server hosted two Intel Xeon Phi coprocessors B1QS-5110P, each with 60 cores and 8 GB of on-board GDDR5 memory. The server was running a Linux operating system CentOS6.4 on the host, and the MPSS (Intel MIC Platform Software Stack) version 2.1.4982-15 on coprocessors. Our tests were performed with one of the two compilers: Intel C++ compiler version 13.1.1 and GNU C++ compiler version 4.4.7.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.For more complete information about performance and benchmark results, visit Performance Test Disclosure

CPU = Xeon = Intel® Xeon ProcessorXeon Phi = Intel® Xeon Phi™ coprocessor

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

*Other logos, brands and names are the property of their respective owners. 18

Page 19: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Still makes sense under Amdahl’s lawThe new, modern code

• Improves CPU-only performance up to 60x from baseline

• Has 1 coprocessor run ~2X the speed of 2 CPUs

A highly parallel processor will improve performance from 1-8x, and generally 1.5-3x

Any larger increase comes from software modernization -- extracting parallelism in the algorithm

Results measured on Stanford University HEATCODE application in labs at Stanford University and Colfax International Configuration: Colfax CX2265i-XP5 server ( http://www.colfax-intl.com/nd/Servers/CX2265i-XP5.aspx ) based on two Intel Xeon E5-2680 processors with 64 GB of DDR3 memory at 1333 MHz. This server hosted two Intel Xeon Phi coprocessors B1QS-5110P, each with 60 cores and 8 GB of on-board GDDR5 memory. The server was running a Linux operating system CentOS6.4 on the host, and the MPSS (Intel MIC Platform Software Stack) version 2.1.4982-15 on coprocessors. Our tests were performed with one of the two compilers: Intel C++ compiler version 13.1.1 and GNU C++ compiler version 4.4.7.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.For more complete information about performance and benchmark results, visit Performance Test Disclosure

CPU = Xeon = Intel® Xeon ProcessorXeon Phi = Intel® Xeon Phi™ coprocessor

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

*Other logos, brands and names are the property of their respective owners. 19

Page 20: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Scaling Applicationson Intel® Xeon® and Intel® Xeon Phi™

*Other logos, brands and names are the property of their respective owners.

… and the performance on HEATCODE scales per node as you add an additional Intel Xeon Phi coprocessor.

Results measured on Stanford University HEATCODE application in labs at Stanford University and Colfax International Configuration: Colfax CX2265i-XP5 server ( http://www.colfax-intl.com/nd/Servers/CX2265i-XP5.aspx ) based on two Intel Xeon E5-2680 processors with 64 GB of DDR3 memory at 1333 MHz. This server hosted two Intel Xeon Phi coprocessors B1QS-5110P, each with 60 cores and 8 GB of on-board GDDR5 memory. The server was running a Linux operating system CentOS6.4 on the host, and the MPSS (Intel MIC Platform Software Stack) version 2.1.4982-15 on coprocessors. Our tests were performed with one of the two compilers: Intel C++ compiler version 13.1.1 and GNU C++ compiler version 4.4.7.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.For more complete information about performance and benchmark results, visit Performance Test Disclosure

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

CPU = Xeon = Intel® Xeon ProcessorXeon Phi = Intel® Xeon Phi™ coprocessor

20

Page 21: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Performance & General Programmability

Source: http://research.colfaxinternational.com/file.axd?file=2013%2f11%2fSC13-Intel-Theater-Talk-Colfax.pdf(see optimization notice)

*Other logos, brands and names are the property of their respective owners. 21

Page 22: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

*Other names and brands may be claimed as the property of others.22

Page 23: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)

Cores, Threads

Vectors

Memory, Caches

Data Layout and Alignment

OpenMP* TBB Cilk™ Plus

Vectorloops

Vectorfunctions

Blocking algorithms

Manual layout, ugly code

AoS SoAlibrary

4 considerations when writing an efficient, unconstrained parallel program

Array notations

threads, locks

Intrinsics

Directives for alignment

23*Other logos, brands and names are the property of their respective owners.

Page 24: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

OpenMP* 4.0 Industry Standard

*Other logos, brands and names are the property of their respective owners.

www.openmp.org

One of the most-capable high-levelparallel languages, now supporting:

• SMP multi-threading• SIMD vectorization• Accelerator offloading

November 17, 2014: Technical Report 3 - Refinements for Accelerators and Task Creation Announcedhttp://openmp.org/wp/tr3-announced-refinements-for-accelerators-and-task-creation/

24

Page 25: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Vector programming in OpenMP* 4.0Mandel example: vectorize an outer loop with function calls

#pragma omp declare simdint mandel(fcomplex c){ // Computes number of iterations for c to escape

fcomplex z = c; int iters = 0;while ((cabsf(z) < 2.0f) && (iters < LIMIT)) {

z = z * z + c; iters++;}return iters;

}

#pragma omp parallel forfor (int y = 0; y < ImageHeight; ++y) {

#pragma omp simdfor (int x = 0; x < ImageWidth; ++x) {

count[y][x] = mandel(in_vals[y][x]);}

}

*Other logos, brands and names are the property of their respective owners. 25

Page 26: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Intel® Software Development Tools“Faster Code Faster”

http://software.intel.com/en-us/intel-parallel-studio-xehttps://software.intel.com/en-us/developer-tools-technical-enterprise

Cluster EditionProfessional Edition

2015

Industry-leading performance from advanced compilers and toolsOptimizations for IA (incl. vector/SIMD and parallel/MT)Industry Standard Parallel programming modelsComprehensive optimized librariesInsightful analysis tools

*Other names and brands may be claimed as the property of others. 26

Page 27: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Intel® Software Development Tools

27

Intel® Parallel Studio XE 2015 Professional Edition• Core and Node level (SMP)

Intel® Parallel Studio XE 2015 Cluster Edition• High Performance Cluster Development Tools for TC/HPC

Intel® Compilers• FORTRAN, C/C++, Cilk™ Plus, OpenMP*4.0

Intel® Libraries• Intel® Math Kernel Library (MKL)• Intel® Integrated Performance Primitives (IPP)• Intel® Threading Building Blocks (TBB)• Intel® MPI Library

Intel® Analysers

*Other names and brands may be claimed as the property of others.

Page 28: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Growing Intel® Parallel Computing Centers Community

*Other names and brands may be claimed as the property of others.

https://software.intel.com/ipcc

28

Page 29: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog

29

Page 30: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Further Readings

*Other names and brands may be claimed as the property of others.

https://software.intel.com/ipcc

https://software.intel.com/mic-developer

https://software.intel.com/en-us/server-developer

High Performance Parallelism Pearls: Multicore and Many-core Programming ApproachesJames Reinders, Jim Jeffers

Intel® Xeon Phi™ Coprocessor Architecture and Tools:The Guide for Application DevelopersRezaur Rahman

Intel Xeon Phi Coprocessor High Performance ProgrammingJim Jeffers, James Reinders

30

Page 31: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Table of Contents• Foreword by Bronis de Supinski (CTO LLNL)• Preface• Chapter 1: No Time to Read this Book?• Chapter 2: Overview of Platform Architectures• Chapter 3: Top-Down Software Optimization• Chapter 4: Addressing System Bottlenecks• Chapter 5: Addressing Application Bottlenecks: Distributed Memory• Chapter 6: Addressing Application Bottlenecks: Shared Memory• Chapter 7: Addressing Microarchitecture Bottlenecks• Chapter 8: Application Design Implications

Meet “the book of the year”

ISBN-13 (pbk): 978-1-4302-6496-5ISBN-13 (electronic): 978-1-4302-6497-2

www.apress.com/9781430264965

Authors: Alexander Supalov, Andrey Semin, Michael Klemm, Chris Dahnken

31

Page 32: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

32

Page 33: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

There is even more …

Intel® Technical Computing PortfolioIntel® based Workstations/Visualization (Embree)

Intel® Cluster Ready (ICR)

Intel® Data Center Manager (DCM)

Intel® SW Development Tools

Intel® Big Data Analytics Toolkit & Acceleration Library

Intel® Enterprise Edition Lustre* (Filesystem)

Intel® SSD (PCIe NVMe)

Intel® based Storage (NVM)

Intel® True Scale (IBA) and Omni-Path Fabric

Intel® Networking (10/40GbE)

Intel® Boards & Systems

Intel® Xeon Phi™ Processor and Coprocessor

Intel® Xeon® Processors

INTEL TECHNICAL COMPUTING & HPC PORTFOLIO

All components working “better together” for a comprehensive and high-performance end-to-end solution based on Intel technologies.

*Other names and brands may be claimed as the property of others. 33

Page 34: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Configuration Information

HEATCODE Example

Measurements taken by Colfax* International in September and October of 2013. System Configuration: Colfax CX2265i-XP5 server ( http://www.colfax-intl.com/nd/Servers/CX2265i-XP5.aspx ) based on two Intel® Xeon® E5-2680 processors with 64 GB of DDR3 memory at 1,333 MHz. This server hosted two Intel® Xeon Phi™ coprocessors B1QS-5110P, each with 60 cores and 8 GB of on-board GDDR5 memory. The server was running a Linux* operating system CentOS* 6.4 on the host, and the MPSS (Intel MIC Platform Software Stack) version 2.1.4982-15 on coprocessors. Our tests were performed with one of the two compilers: Intel C++ compiler version 13.1.1 and GNU C++ compiler version 4.4.7 .

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.

For more complete information about performance and benchmark results, visit Performance Test Disclosure

.

*Other names and brands may be claimed as the property of others. 34

Page 35: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Intel® Compilers – Performance Benefits

*Other names and brands may be claimed as the property of others. 35

Page 36: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT INTHE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.Intel, Intel® Xeon®, Intel Xeon Phi™, Intel® True Scale, Intel® Enterprise Edition Lustre, Intel® Data Center Manager, Intel® Cluster Ready and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.Copyright ©2014 Intel Corporation.

36

Page 37: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Optimization Notice

37

Page 38: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s

Copyright © 2015, Intel Corporation. All rights reserved.

Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company’s expectations. Demand for Intel's products is highly variable and, in recent years, Intel has experienced declining orders in the traditional PC market segment. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; customer acceptance of Intel’s and competitors’ products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel’s results could be affected by the timing of closing of acquisitions, divestitures and other significant transactions. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC filings. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.

Rev. 4/15/14

38

Page 39: SC’14 HPC Dev. Workshop Session Presentation Template · Vector Parallel Vector. 1 core no SIMD. 1 core with SIMD. all cores with SIMD. 12. ... Still makes sense under Amdahl’s