hpc-lab - tum

Department of Informatics V

Alex Breuer

Session 1: VectorizationM. Bader, A. Breuer

HPC-Lab


Alex Breuer

Organization• Lab course

• Teams: 3 students each

• 4 assignments + project phase

• Presentation of your work during big meetings (no exams)

• Mondays, 4 PM - 6 PM

• Grading

• Assignments: 10 points each, project phase: 20 points

• Presentations during big meetings

2

1. Vectorization

2. OpenMP

3. Xeon Phi coprocessor

4. MPI

5. Project


Alex Breuer

Big Meetings

• Attendance is obligatory

• Each group has to prepare a presentation

3

Date Schedule10/13/14 Kickoff10/27/14 Presentation 111/10/14 Presentation 212/01/14 Presentation 312/15/14 Presentation 401/12/15 Report: Project phase01/19/15 Presentation Project


Alex Breuer

Save The Date

• Q&A: Be there by choice

• Invited presentations: Announced >7 days prior

4

Date Schedule

10/20/14 Q&A

11/03/14 Q&A

12/08/14 Q&A

12/22/14 Q&A


Alex Breuer

What’s “HPC”?

• Understand the physics!

• Understand the numerics!

• Understand the architectures!

• Get the solution fast (efficient)!

5

http://www.google.com

http://www.google.com


Alex Breuer

Levels Of Parallelism• Classify parallel applications according to granularity of parallelism.

Different kinds of hardware support different levels of parallelism.

• Intra-instruction level:

• Vector-computer

• SSE, AVX, MIC

• Cell BE, GPU

• …

• Instruction level

• Super-scalar architecture

• Pipelining…

• Thread level

• Shared-Memory-Machine

• Multi- and manycore processors

6

Sandy Bridge Ex. Units,http://www.realworldtech.com/haswell-cpu/4/

Sandy Bridge CPU,http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1

http://www.realworldtech.com/haswell-cpu/4/

http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1


Alex Breuer

Levels Of Parallelism• Process level:

• Distributed-Memory-Machine

• Supercomputer

• Application level:

• Workstation-Cluster or -Pool

• Grid/Cloud-Computing

• Supercomputer (multi-user scheduling)

7

SuperMUChttps://www.lrz.de/presse/fotos/

Stampede

https://www.lrz.de/presse/fotos/


Alex Breuer

Von-Neumann-Principle

• Instruction pipelining: Overlap independent instructions

• Increases instruction throughout

• Pipeline with k stages retires, after a certain preload-time, starting with cycle k one instruction per cycle!

8


Alex Breuer

Pipelining

• Instruction pipelining: Overlap independent instructions

• Increases instruction throughout

• Pipeline with k stages retires, after a certain preload-time, starting with cycle k one instruction per cycle!

9


Alex Breuer

Memory Hierarchy

10http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/

http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/


Alex Breuer

Cache• transparent buffer memory

• exploiting locality:

• temporal locality: If we access address x, we probably will access address x in near future again.

• spatial locality: If we access address x, we probably will access addresses x + δ or x − δ, too.

11




Alex Breuer

Sandy Bridge

12

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-16

2.2.5.1 Load and Store Operation Overview

This section provides an overview of the load and store operations.

Loads

When an instruction reads data from a memory location that has write-back (WB) type, the processor looks for it in the caches and memory. Table 2-10 shows the access lookup order and best case latency. The actual latency can vary depending on the cache queue occupancy, LLC ring occupancy, memory components, and their parameters.

The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1 caches. If there is an indication in the LLC that other cores may hold the line of interest and its state might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called "clean" if it does not require fetching data from the other core caches. The lookup is called "dirty" if modi-fied data has to be fetched from the other core caches and transferred to the loading core.

The latencies shown above are the best-case scenarios. Sometimes a modified cache line has to be evicted to make space for a new cache line. The modified cache line is evicted in parallel to bringing the new data and does not require additional latency. However, when data is written back to memory, the eviction uses cache bandwidth and possibly memory bandwidth as well. Therefore, when multiple cache misses require the eviction of modified lines within a short time, there is an overall degradation in cache response time. Memory access latencies vary based on occupancy of the memory controller queues, DRAM configuration, DDR parameters, and DDR paging behavior (if the requested page is a page-hit, page-miss or page-empty).

Stores

When an instruction writes data to a memory location that has a write back memory type, the processor first ensures that it has the line containing this memory location in its L1 DCache, in Exclusive or Modified MESI state. If the cache line is not there, in the right state, the processor fetches it from the next levels

Table 2-9. Cache Parameters

Level CapacityAssociativity (ways)

Line Size (bytes)

Write Update Policy Inclusive

L1 Data 32 KB 8 64 Writeback -

Instruction 32 KB 8 N/A N/A -

L2 (Unified) 256 KB 8 64 Writeback No

Third Level (LLC) Varies, query CPUID leaf 4

Varies with cache size

64 Writeback Yes

Table 2-10. Lookup Order and Load Latency

Level Latency (cycles) Bandwidth (per core per cycle)

L1 Data 41

NOTES:1. Subject to execution core bypass restriction shown in Table 2-8.

2 x16 bytes

L2 (Unified) 12 1 x 32 bytes

Third Level (LLC) 26-312

2. Latency of L3 varies with product segment and sku. The values apply to second generation IntelCore processor families.

1 x 32 bytes

L2 and L1 DCache in other cores if applicable

43- clean hit;

60 - dirty hit

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-16

2.2.5.1 Load and Store Operation Overview

This section provides an overview of the load and store operations.

Loads

When an instruction reads data from a memory location that has write-back (WB) type, the processor looks for it in the caches and memory. Table 2-10 shows the access lookup order and best case latency. The actual latency can vary depending on the cache queue occupancy, LLC ring occupancy, memory components, and their parameters.

The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1 caches. If there is an indication in the LLC that other cores may hold the line of interest and its state might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called "clean" if it does not require fetching data from the other core caches. The lookup is called "dirty" if modi-fied data has to be fetched from the other core caches and transferred to the loading core.

The latencies shown above are the best-case scenarios. Sometimes a modified cache line has to be evicted to make space for a new cache line. The modified cache line is evicted in parallel to bringing the new data and does not require additional latency. However, when data is written back to memory, the eviction uses cache bandwidth and possibly memory bandwidth as well. Therefore, when multiple cache misses require the eviction of modified lines within a short time, there is an overall degradation in cache response time. Memory access latencies vary based on occupancy of the memory controller queues, DRAM configuration, DDR parameters, and DDR paging behavior (if the requested page is a page-hit, page-miss or page-empty).

Stores

When an instruction writes data to a memory location that has a write back memory type, the processor first ensures that it has the line containing this memory location in its L1 DCache, in Exclusive or Modified MESI state. If the cache line is not there, in the right state, the processor fetches it from the next levels

Table 2-9. Cache Parameters

Level CapacityAssociativity (ways)

Line Size (bytes)

Write Update Policy Inclusive

L1 Data 32 KB 8 64 Writeback -

Instruction 32 KB 8 N/A N/A -

L2 (Unified) 256 KB 8 64 Writeback No

Third Level (LLC) Varies, query CPUID leaf 4

Varies with cache size

64 Writeback Yes

Table 2-10. Lookup Order and Load Latency

Level Latency (cycles) Bandwidth (per core per cycle)

L1 Data 41

NOTES:1. Subject to execution core bypass restriction shown in Table 2-8.

2 x16 bytes

L2 (Unified) 12 1 x 32 bytes

Third Level (LLC) 26-312

2. Latency of L3 varies with product segment and sku. The values apply to second generation IntelCore processor families.

1 x 32 bytes

L2 and L1 DCache in other cores if applicable

43- clean hit;

60 - dirty hit

Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual

• Understand your architecture: Documentation is out there!

• Example: “Internally, accesses are up to 16 bytes, with 256-bit Intel AVX instructions utilizing two 16-byte accesses. Two load operations and one store operation can be handled each cycle.”⇒ Register Blocking


Alex Breuer

Cache Blocking• Example: Matrix-Matrix-Multiplication

• Register-Blocking: L1$-bandwith too small for moving 3 operands per cycle (e.g. Intel Sandy Bridge 48 bytes per cycle, but we would need 96 bytes!)

• Cache-Blocking: bandwidth of main memory is orders of magnitudes slower than caches (200 cycles vs. 10 cycles)

• TLB-Blocking: change data structure in order to avoid DTLB misses, usage of huge pages

13

Technische Universitat Munchen

Cache-Blocking• example matrix-multiplikation

for(i = 0; i < n; i++)

for(j = 0; j < n; j++)

for(k = 0; k < n; k++)

c[i,j] += a[i,k]

*

b[k,j];

• Register-Blocking: bandwidth to L1-cache is to small formoving 3 operands per cycle (e.g. Intel Sandy Bridge 48bytes per cycle, but we would need 96 bytes!)

• Cache-Blocking: bandwidth of main memory is orders ofmagnitudes slower than caches (200 cycles vs. 10 cycles)

• TLB-Blocking: change data structure in order to avoidDTLB misses, usage of huge pages

Michael Bader Alexander Heinecke Alexander Breuer: Masterpraktikum Scientific Computing

Introduction, Caches, Vectorization, October 22, 2013 25


Alex Breuer

SIMD• Parallelism starts at core level

• Single Instruction Multiple Data (SIMD): FPUs can operate on complete registers per cycle

• Example: 256-bit SIMD⇔ 4 double precision (64 bit) ops

14


SIMD operation• Parallelism starts at core level• Single instruction multiple data (SIMD): FPUs can operate on

complete registers per cycle• Example: 256-bit SIMD register⇔ 4 double precision (64 bit)

operands

6-3

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS

6.5.1.1 Vertical versus Horizontal Computation

The majority of the floating-point arithmetic instructions in SSE/SSE2 provide greater performance gain on vertical data processing for parallel data elements. This means each element of the destination is the result of an arithmetic operation performed from the source elements in the same vertical position (Figure 6-1).

To supplement these homogeneous arithmetic operations on parallel data elements, SSE and SSE2 provides data movement instructions (e.g., SHUFPS, UNPCKLPS, UNPCKHPS, MOVLHPS, MOVHLPS, etc.) that facilitate moving data elementshorizontally.

The organization of structured data have a significant impact on SIMD programming efficiency and performance. This can be illustrated using two common type of data structure organizations:• Array of Structure: This refers to the arrangement of an array of data structures. Within the data

structure, each member is a scalar. This is shown in Figure 6-2. Typically, a repetitive sequence of computation is applied to each element of an array, i.e. a data structure. Computational sequence for the scalar members of the structure is likely to be non-homogeneous within each iteration. AoS is generally associated with a horizontal computation model.

• Structure of Array: Here, each member of the data structure is an array. Each element of the array is a scalar. This is shown Table 6-1. Repetitive computational sequence is applied to scalar elements and homogeneous operation can be easily achieved across consecutive iterations within the same structural member. Consequently, SoA is generally amenable to the vertical computation model.

Figure 6-1. Homogeneous Operation on Parallel Data Elements

Figure 6-2. Horizontal Computation Model

X3 X2 X1 X0

Y3 Y2 Y1 Y0

X3 OP Y3 X2 OP Y2 X 1OP Y1 X0 OP Y0

OPOPOPOP

X Y Z W


Alexander Breuer, Tobias Weinzierl: Advanced Programming

Autovectorization, December 05, 2013 7



Alex Breuer

Theoretical Peak!

• Definition: #(floating point operations) per second

• Relative to:

• Single/Double precision (SP/DP) Here: DP == 64 bit floating point values

• As frequency we neglect Intel® Turbo Boost Technology

• Warning: Abbreviation ”FLOPs” / ”FLOPS” used for both ”#(floating point operations)” and ”#(floating point operations) / second”

• Calculation:

• Calculate the #(FLOPs) per FPU and cycle

• Sum up and multiply by the frequency (CPU cycles / second)

15


Alex Breuer

Peak through Vectorization• SNB (SuperMUC, E5-2680):

• 256-bit FMUL and 256-bit FADD → 8 #(FLOPs) / (cycle and core)

• → 2.7 ∗ 10^9 ∗ 8 = 21.6 DP-GFLOPS / core

• → 8 ∗ 21.6 = 172.8 DP-GFLOPS / processor

• Haswell (My notebook, i7-4650U)

• Two 256-bit FMAs → 16 #(FLOPs) / (cycle and core)

• → 1.7 ∗ 10^9 ∗ 16 = 27.2 DP-GFLOPS / core

• → 2 ∗ 27.2 = 54.4 DP-GFLOPs / processor

!

• Future: Lower frequencies, longer vector units, more cores

16

SuperMUChttps://www.lrz.de/presse/

fotos/

https://www.lrz.de/presse/fotos/


Alex Breuer

Simple Loop: Scalar

17


Example: Read After Write

Code

32 // multiply input a and b33 for( int l_i = 1; l_i < VECTOR_SIZE; l_i++ ) {34 l_a[l_i] = l_a[l_i -1] * l_b[l_i];35 }

code: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/read_after_write.cpp

Vector-report

1 read_after_write.cpp (28): (col. 3) remark: vectorization �� support: unroll factor set to 4.2 read_after_write.cpp (28): (col. 3) remark: LOOP WAS VECTORIZED.3 read_after_write.cpp (33): (col. 3) remark: loop was not �� vectorized: existence of vector dependence.4 read_after_write.cpp (34): (col. 5) remark: vector dependence: �� assumed FLOW dependence between l_a line 34 and l_a line 34.5 read_after_write.cpp (39): (col. 15) remark: vectorization �� support: call to function _ZNSolsEd cannot be vectorized.

output: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/vector_reports/intel_

read_after_write_vec_report.log



a

b

c

a

b

c

0a

b

c

1

a

b

c

2


Alex Breuer

Simple Loop: SIMD

18


Vectorized Execution (2)

32 // multiply input a and b33 for( int l_i = 0; l_i < VECTOR_SIZE; l_i++ ) {34 l_c[l_i] = l_a[l_i] * l_b[l_i];35 }

code: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/simple_loop.cpp



a

b

c

a

b

c

0

a

b

c

4

a

b

c

8


Alex Breuer

Autovectorization• Intel Compiler Pragmas for Intel64 Vectorization:

• #pragma vector always ⇒ vectorize in any case (even if compiler detects inefficiencies)

• #pragma ivdep ⇒ disable some dependency checks during compilation

• #pragma simd ⇒ aggressive vectorization, might result in wrong code

• further information (SSE & AVX):

• Intel Architecture Optimization Reference Manual

• Intel C++ Compiler User’s Guide

19


Auto-Vectorization#pragma vector always

#pragma ivdep

for(i=1;i<n;i++) {

c[i]=s

*

c[i];

}

Intel Compiler Pragmas for Intel64 Vectorization:• #pragma vector always) vectorize in any case (even if compiler

detects inefficiencies)• #pragma ivdep) disable some dependency checks during

compilation• #pragma simd) aggressive vectorization, might result in wrong

code!further information (SSE & AVX):

• Intel Architecture Optimization Reference Manual• Intel C++ Compiler User’s Guide




Alex Breuer

Dependency Analysis

• no dependency across iterations-> can be optimal and vectorized and pipelined

20


Dependency Analysis and Vectorization• Loop I

for(i=0;i<n;i++) {

a[i]=b[i]+c[i]

}

no dependency across iterations) can be optimally vectorizedand pipelined

• Loop IIfor(i=1;i<n-1;i++) {

c[i]=c[i]

*

c[i-1];

}

Recursion! ci�1 is still processed when calculating ci ) novectorization is possible




Dependency Analysis and Vectorization• Loop I

for(i=0;i<n;i++) {

a[i]=b[i]+c[i]

}

no dependency across iterations) can be optimally vectorizedand pipelined

• Loop IIfor(i=1;i<n-1;i++) {

c[i]=c[i]

*

c[i-1];

}

Recursion! ci�1 is still processed when calculating ci ) novectorization is possible



• Recursion! c[i-1] still processed when calculating c[i]-> no vectorization is possible


Alex Breuer

Intrinsic Functions• Suffixes:• ps, pd→packed single- and double-precision floating point functions• ss, sd→scalar single- and double-precision fp functions

• Documentation of all functions in chapter “Intel C++ Intrinsics Reference” of Intel C++ Compiler User’s Guide

21

__m128i; __m128d; // register-variables SSE

__m256i; __m256d; // register-variables AVX

!

__m128d _mm_instr_pd(__m128d a, __m128d b)

__m256d _mm_instr_pd(__m256d a, __m256d b)


Alex Breuer

Example

22

#include <immintrin.h>int avx_add(int length, double* a, double* b, double* c) { int i; __m256d t0,t1; for (i = 0; i < length; i +=4) { // load four 64-bit-double values t0 = _mm256_load_pd(a[i]); t1 = _mm256_load_pd(b[i]); // add four64-bit-double values t0 = _mm256_add_pd(t0,t1); // store 4 64-bit-double values _m256m_store_pd(c[i],t0); }}


Alex Breuer

MAC Cluster• logins: one login per participant

• change password through: https://idportal.lrz.de/r/entry.pl: Passwort ändern on left side, type old one, repeat your new one two times and hit Passwort ändern afterwards.

• ssh [email protected] or [email protected], within the MWN (VPN client or computing hall)

• read the motd carefully

• module system: module load, list, unload, info

• Documentation & policies

• http://www.mac.tum.de/wiki/index.php/MAC_Cluster

• http://www.lrz.de/services/compute/linux-cluster/intro

23

http://www.mac.tum.de/wiki/index.php/MAC_Cluster

http://www.lrz.de/services/compute/linux-cluster/intro

hpc-lab - tum

Documents