1 algorithmic optimizations for many-core wide-vector processors jongsoo park parallel computing...

Algorithmic Optimizations forMany-core Wide-vector Processors

Jongsoo Park

Parallel Computing Lab, Intel

Notice and Disclaimers

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.

Intel® Itanium®, Intel® Xeon®, Xeon Phi™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2012, Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others..

Notice and Disclaimers Continued …Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Outline

• Intel’s take on many-core processors–Architecture of Knights Corner

coprocessors–Programming model

• Low-communication algorithm:“segment-of-interest” fast Fourier transform

• Approximate computing

Intel® Xeon PhiTM Coprocessor

Xeon PhiTM Architecture

60 cores

Ring Interconnect

512KBx60 = 30MB L2$

16-wide SIMD (SP)

8-wide SIMD (DP)

1.1GHzx60x16x2 =

2.1 TFLOPS (SP)

1.1GHzx60x8x2 =

1.05 TFLOPS (DP)

Adding 2 Xeon Phis 7x of 2-socket Sandy Bridge

512KB L2 per Core

Xeon PhiTM Programming ModelsOpenMP, pthread

MPI: a set of cores can directly work as a MPI process

OpenCL: recently announced

Cilk+: task-level parallelism

ISPC: thread+SIMD expressed same (as CUDA)

Automatic parallelization by compiler

Other frameworks for Xeon are easy to port

Key Performance ConsiderationsParallelism

– 60 cores– SIMD: 16-wide (SP), 8-wide (DP), gather/scatter, swizzling

Memory bandwidth– On-chip caches, non-temporal stores– Spatial locality

Memory latency hiding– 4 simultaneous multi-threads per core– Pre-fetching

PCIe latency hiding– Asynchronous offload directives– Asynchronous MPI calls

Software Portability

Correctness portability:– Any code worked for Xeon® works right away

Performance portability:– The optimization typically speedups both Xeon®

and Xeon Phi™

Performance Portability Example – “Segment-of-Interest” FFTThe same code is used for both Xeon® and Xeon Phi™– Same optimizations:

loop interchange/tiling, unroll-and-jam, …

– Just different tiling and unroll factors

My Xeon Phi™ Optimization Flow –Design top-down, measure bottom-upDesign data layout and decomposition

– Maximize locality and minimize communication/synchronization– Consider vectorization (e.g., SOA vs. AOS)– Single-thread optimizations depend on this

Measure single-core performance– Check vectorization and prefetching: -vec-report compiler flag– Pragmas for vectorization and prefetching when appropriate– Convince yourself why you are achieving a measured compute

efficiency: IPC and L1/L2$ misses from vtune are useful metrics

Measure thread-level scaling– Use more cores while appropriately scaling input– Vtune metrics to look at: load balance, remote L2$ access

Algorithmic Optimizations

Before doing all of these, ask yourself

“Is this the right algorithm for modern processors?”

Low-communication Algorithms

Communication-bound Application Example: 1D FFT

Surprisingly low compute efficiency of 1D FFT in HPCC list

~2% efficiency in K computer

Many-core wide-vector processor + communication-bound application

Estimated 1-D FFT time for 232 DP complex numbers

with 32 nodes of (2-socket Xeon E5-2680 + 1 Xeon Phi SE10)Park et al., submitted to SC’13

Cooley-Tukey Factorization

Circa 1965 (also 1866 by Gauss)

N=MP M length-P FFTs + P length-M FFTs

3 All-to-allCommunication

Segment-of-Interest (SOI) Factorization

3 all-to-all 1 all-to-all

Tang, Park, Kim, and Petrov, SC’12

Trading off computation for communication

Park et al., submitted to SC’13

Approximate Computing

Transcendental Functions inSynthetic Aperture Radar (SAR)

Transcendental functions are hard to vectorize1 3Kx3K image reconstructed from 2.8K pulses with 4K samples each, Xeon: Intel® Xeon® E5-2670, 2-

socket, 2.6GHz

Park et al., SC’12

Approximate Strength Reduction in SAR

2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card

specifications.

Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Strength Reduction

for i from 0 to N

A[i] = c x i

1 multiplication

for i from 0 to N

A[i] = t

t += c

1 addition

Approximate Strength Reduction (ASR)

for each pixel (i, j) at x

R = sqrt((xi-pi)2+(xj-pj)2)

bin = (R – R0)*idr

arg =(cos(kR),sin(kR))

1 sqrt

pre-compute A, B, C, Φ, Ψ, Γ

bin = A[j]+B[i]+j*C[i]

arg = Φ[j]*Ψ[i]*γ[i]

γ[i]*= Γ[i]

13 multiplications

10 additions

1 sqrt (DP)

1 cos (w/ DP arg. reduction)

1 sin (w/ DP arg. reduction)

γ[i]*= Γ[i]

13 multiplications (SP)

10 additions (SP)

1 sqrt (DP)

1 cos (w/ DP arg. reduction)

1 sin (w/ DP arg. reduction)

γ[i]*= Γ[i]

13 multiplications (SP)

10 additions (SP)

All vectorizable

ASR Mathematics – Square Root

• Approximate sqrt by the 2nd-order Taylor series

• Apply the conventional strength reduction

Pre-compute constants (, , …)

incrementally compute: e.g., =

ASR in SAR

2-4x speedups by optimizing sqrt, sin, and cosIntel® Xeon Phi™ Results: Evaluation card only and not necessarily reflective of production card

specifications.

Xeon: Intel® Xeon® E5-2670, 2-socket, 2.6GHz

ASR Mathematics – Accuracy

• Approximate sqrt by 2nd-order Taylor series

• Error increases as i or j gets larger apply ASR per block (bound i and j)

• Mixed-precisionDP: pre-compute constantsSP: main-compute

Accuracy and Performance Trade-offs

20 higher dB = 0.5x error64x64 blocks: ~2x speedup w/ similar accuracy

Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz

ASR in Ultrasound Beamforming

16x64 blocks: ~2x speedup with similar accuracyCollaboration with GE Healthcare

Measured in Intel® Xeon® E5-2670, 2-socket, 2.6GHz

Accuracy-Performance Trade-offs in SOI FFT

Tang, Park, Kim, and Petrov, SC’12

Conclusion

Xeon Phi– Many-core wide-vector coprocessor with software

portability

Algorithmic Optimizations– Low communication algorithms: SOI FFT– Approximation: approximate strength reduction

Future WorkGeneralization and tool support

AcknowledgementPeter Tang

Intel Parallel Computing Lab: Daehyun Kim and Mikhail Smelyanskiy, Ganesh Bikshandi, Karthikeyan Vaidyanathan, and Pradeep Dubey

Intel MKL Team: Vladimir Petrov and Robert Hanek

Georgia Tech Research Institute: Thomas Benson, Daniel Campbell

Reservoir Lab: Nicolas Vasilache and Richard Lethin

DARPA UHPC project*

* This research was, in part, funded by the U.S. Government under contract number HR0011-10-3-0007. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

1 algorithmic optimizations for many-core wide-vector processors jongsoo park parallel computing...

intel slide

intel xeon

intel itanium

intel speedstep

intel netburst

intel reserves

intel literature

intel microprocessors

Documents

moodle performance optimizations

interconnect optimizations

intraprocedural optimizations

optimizing graph algorithms for improved cache...

l4linux porting optimizations

embedded linux optimizations

arxiv:1611.06256v3 [cs.lg] 2 mar 2017tems optimizations....

z-buffer optimizations

architectural optimizations

alma cycle 2 capability jongsoo kim alma ea korea node

intelligent system optimizations

optimizing cuda. 2 © nvidia corporation 2008 outline...

hive join optimizations

optimizations - advanced

intel compiler options and optimizations - scinetwiki...

web optimizations

frontend optimizations (greek)

lec08 optimizations

advanced interconnect optimizations

bytecode optimizations