hpc-lab - tum
TRANSCRIPT
Department of Informatics V
Alex Breuer
Session 1: VectorizationM. Bader, A. Breuer
HPC-Lab
Department of Informatics V
Alex Breuer
Organization• Lab course
• Teams: 3 students each
• 4 assignments + project phase
• Presentation of your work during big meetings (no exams)
• Mondays, 4 PM - 6 PM
• Grading
• Assignments: 10 points each, project phase: 20 points
• Presentations during big meetings
2
1. Vectorization
2. OpenMP
3. Xeon Phi coprocessor
4. MPI
5. Project
Department of Informatics V
Alex Breuer
Big Meetings
• Attendance is obligatory
• Each group has to prepare a presentation
3
Date Schedule10/13/14 Kickoff10/27/14 Presentation 111/10/14 Presentation 212/01/14 Presentation 312/15/14 Presentation 401/12/15 Report: Project phase01/19/15 Presentation Project
Department of Informatics V
Alex Breuer
Save The Date
• Q&A: Be there by choice
• Invited presentations: Announced >7 days prior
4
Date Schedule
10/20/14 Q&A
11/03/14 Q&A
12/08/14 Q&A
12/22/14 Q&A
Department of Informatics V
Alex Breuer
What’s “HPC”?
• Understand the physics!
• Understand the numerics!
• Understand the architectures!
• Get the solution fast (efficient)!
5
http://www.google.com
Department of Informatics V
Alex Breuer
Levels Of Parallelism• Classify parallel applications according to granularity of parallelism.
Different kinds of hardware support different levels of parallelism.
• Intra-instruction level:
• Vector-computer
• SSE, AVX, MIC
• Cell BE, GPU
• …
• Instruction level
• Super-scalar architecture
• Pipelining…
• Thread level
• Shared-Memory-Machine
• Multi- and manycore processors
6
Sandy Bridge Ex. Units,http://www.realworldtech.com/haswell-cpu/4/
Sandy Bridge CPU,http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1
Department of Informatics V
Alex Breuer
Levels Of Parallelism• Process level:
• Distributed-Memory-Machine
• Supercomputer
• Application level:
• Workstation-Cluster or -Pool
• Grid/Cloud-Computing
• Supercomputer (multi-user scheduling)
7
SuperMUChttps://www.lrz.de/presse/fotos/
Stampede
Department of Informatics V
Alex Breuer
Von-Neumann-Principle
• Instruction pipelining: Overlap independent instructions
• Increases instruction throughout
• Pipeline with k stages retires, after a certain preload-time, starting with cycle k one instruction per cycle!
8
Department of Informatics V
Alex Breuer
Pipelining
• Instruction pipelining: Overlap independent instructions
• Increases instruction throughout
• Pipeline with k stages retires, after a certain preload-time, starting with cycle k one instruction per cycle!
9
Department of Informatics V
Alex Breuer
Memory Hierarchy
10http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
Department of Informatics V
Alex Breuer
Cache• transparent buffer memory
• exploiting locality:
• temporal locality: If we access address x, we probably will access address x in near future again.
• spatial locality: If we access address x, we probably will access addresses x + δ or x − δ, too.
11
http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
Department of Informatics V
Alex Breuer
Sandy Bridge
12
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2-16
2.2.5.1 Load and Store Operation Overview
This section provides an overview of the load and store operations.
Loads
When an instruction reads data from a memory location that has write-back (WB) type, the processor looks for it in the caches and memory. Table 2-10 shows the access lookup order and best case latency. The actual latency can vary depending on the cache queue occupancy, LLC ring occupancy, memory components, and their parameters.
The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1 caches. If there is an indication in the LLC that other cores may hold the line of interest and its state might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called "clean" if it does not require fetching data from the other core caches. The lookup is called "dirty" if modi-fied data has to be fetched from the other core caches and transferred to the loading core.
The latencies shown above are the best-case scenarios. Sometimes a modified cache line has to be evicted to make space for a new cache line. The modified cache line is evicted in parallel to bringing the new data and does not require additional latency. However, when data is written back to memory, the eviction uses cache bandwidth and possibly memory bandwidth as well. Therefore, when multiple cache misses require the eviction of modified lines within a short time, there is an overall degradation in cache response time. Memory access latencies vary based on occupancy of the memory controller queues, DRAM configuration, DDR parameters, and DDR paging behavior (if the requested page is a page-hit, page-miss or page-empty).
Stores
When an instruction writes data to a memory location that has a write back memory type, the processor first ensures that it has the line containing this memory location in its L1 DCache, in Exclusive or Modified MESI state. If the cache line is not there, in the right state, the processor fetches it from the next levels
Table 2-9. Cache Parameters
Level CapacityAssociativity (ways)
Line Size (bytes)
Write Update Policy Inclusive
L1 Data 32 KB 8 64 Writeback -
Instruction 32 KB 8 N/A N/A -
L2 (Unified) 256 KB 8 64 Writeback No
Third Level (LLC) Varies, query CPUID leaf 4
Varies with cache size
64 Writeback Yes
Table 2-10. Lookup Order and Load Latency
Level Latency (cycles) Bandwidth (per core per cycle)
L1 Data 41
NOTES:1. Subject to execution core bypass restriction shown in Table 2-8.
2 x16 bytes
L2 (Unified) 12 1 x 32 bytes
Third Level (LLC) 26-312
2. Latency of L3 varies with product segment and sku. The values apply to second generation IntelCore processor families.
1 x 32 bytes
L2 and L1 DCache in other cores if applicable
43- clean hit;
60 - dirty hit
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2-16
2.2.5.1 Load and Store Operation Overview
This section provides an overview of the load and store operations.
Loads
When an instruction reads data from a memory location that has write-back (WB) type, the processor looks for it in the caches and memory. Table 2-10 shows the access lookup order and best case latency. The actual latency can vary depending on the cache queue occupancy, LLC ring occupancy, memory components, and their parameters.
The LLC is inclusive of all cache levels above it - data contained in the core caches must also reside in the LLC. Each cache line in the LLC holds an indication of the cores that may have this line in their L2 and L1 caches. If there is an indication in the LLC that other cores may hold the line of interest and its state might have to modify, there is a lookup into the L1 DCache and L2 of these cores too. The lookup is called "clean" if it does not require fetching data from the other core caches. The lookup is called "dirty" if modi-fied data has to be fetched from the other core caches and transferred to the loading core.
The latencies shown above are the best-case scenarios. Sometimes a modified cache line has to be evicted to make space for a new cache line. The modified cache line is evicted in parallel to bringing the new data and does not require additional latency. However, when data is written back to memory, the eviction uses cache bandwidth and possibly memory bandwidth as well. Therefore, when multiple cache misses require the eviction of modified lines within a short time, there is an overall degradation in cache response time. Memory access latencies vary based on occupancy of the memory controller queues, DRAM configuration, DDR parameters, and DDR paging behavior (if the requested page is a page-hit, page-miss or page-empty).
Stores
When an instruction writes data to a memory location that has a write back memory type, the processor first ensures that it has the line containing this memory location in its L1 DCache, in Exclusive or Modified MESI state. If the cache line is not there, in the right state, the processor fetches it from the next levels
Table 2-9. Cache Parameters
Level CapacityAssociativity (ways)
Line Size (bytes)
Write Update Policy Inclusive
L1 Data 32 KB 8 64 Writeback -
Instruction 32 KB 8 N/A N/A -
L2 (Unified) 256 KB 8 64 Writeback No
Third Level (LLC) Varies, query CPUID leaf 4
Varies with cache size
64 Writeback Yes
Table 2-10. Lookup Order and Load Latency
Level Latency (cycles) Bandwidth (per core per cycle)
L1 Data 41
NOTES:1. Subject to execution core bypass restriction shown in Table 2-8.
2 x16 bytes
L2 (Unified) 12 1 x 32 bytes
Third Level (LLC) 26-312
2. Latency of L3 varies with product segment and sku. The values apply to second generation IntelCore processor families.
1 x 32 bytes
L2 and L1 DCache in other cores if applicable
43- clean hit;
60 - dirty hit
Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual
• Understand your architecture: Documentation is out there!
• Example: “Internally, accesses are up to 16 bytes, with 256-bit Intel AVX instructions utilizing two 16-byte accesses. Two load operations and one store operation can be handled each cycle.”⇒ Register Blocking
Department of Informatics V
Alex Breuer
Cache Blocking• Example: Matrix-Matrix-Multiplication
• Register-Blocking: L1$-bandwith too small for moving 3 operands per cycle (e.g. Intel Sandy Bridge 48 bytes per cycle, but we would need 96 bytes!)
• Cache-Blocking: bandwidth of main memory is orders of magnitudes slower than caches (200 cycles vs. 10 cycles)
• TLB-Blocking: change data structure in order to avoid DTLB misses, usage of huge pages
13
Technische Universitat Munchen
Cache-Blocking• example matrix-multiplikation
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
for(k = 0; k < n; k++)
c[i,j] += a[i,k]
*
b[k,j];
• Register-Blocking: bandwidth to L1-cache is to small formoving 3 operands per cycle (e.g. Intel Sandy Bridge 48bytes per cycle, but we would need 96 bytes!)
• Cache-Blocking: bandwidth of main memory is orders ofmagnitudes slower than caches (200 cycles vs. 10 cycles)
• TLB-Blocking: change data structure in order to avoidDTLB misses, usage of huge pages
Michael Bader Alexander Heinecke Alexander Breuer: Masterpraktikum Scientific Computing
Introduction, Caches, Vectorization, October 22, 2013 25
Department of Informatics V
Alex Breuer
SIMD• Parallelism starts at core level
• Single Instruction Multiple Data (SIMD): FPUs can operate on complete registers per cycle
• Example: 256-bit SIMD⇔ 4 double precision (64 bit) ops
14
Department of Informatics V
SIMD operation• Parallelism starts at core level• Single instruction multiple data (SIMD): FPUs can operate on
complete registers per cycle• Example: 256-bit SIMD register⇔ 4 double precision (64 bit)
operands
6-3
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
6.5.1.1 Vertical versus Horizontal Computation
The majority of the floating-point arithmetic instructions in SSE/SSE2 provide greater performance gain on vertical data processing for parallel data elements. This means each element of the destination is the result of an arithmetic operation performed from the source elements in the same vertical position (Figure 6-1).
To supplement these homogeneous arithmetic operations on parallel data elements, SSE and SSE2 provides data movement instructions (e.g., SHUFPS, UNPCKLPS, UNPCKHPS, MOVLHPS, MOVHLPS, etc.) that facilitate moving data elementshorizontally.
The organization of structured data have a significant impact on SIMD programming efficiency and performance. This can be illustrated using two common type of data structure organizations:• Array of Structure: This refers to the arrangement of an array of data structures. Within the data
structure, each member is a scalar. This is shown in Figure 6-2. Typically, a repetitive sequence of computation is applied to each element of an array, i.e. a data structure. Computational sequence for the scalar members of the structure is likely to be non-homogeneous within each iteration. AoS is generally associated with a horizontal computation model.
• Structure of Array: Here, each member of the data structure is an array. Each element of the array is a scalar. This is shown Table 6-1. Repetitive computational sequence is applied to scalar elements and homogeneous operation can be easily achieved across consecutive iterations within the same structural member. Consequently, SoA is generally amenable to the vertical computation model.
Figure 6-1. Homogeneous Operation on Parallel Data Elements
Figure 6-2. Horizontal Computation Model
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 OP Y3 X2 OP Y2 X 1OP Y1 X0 OP Y0
OPOPOPOP
X Y Z W
Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual
Alexander Breuer, Tobias Weinzierl: Advanced Programming
Autovectorization, December 05, 2013 7
Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual
Department of Informatics V
Alex Breuer
Theoretical Peak!
• Definition: #(floating point operations) per second
• Relative to:
• Single/Double precision (SP/DP) Here: DP == 64 bit floating point values
• As frequency we neglect Intel® Turbo Boost Technology
• Warning: Abbreviation ”FLOPs” / ”FLOPS” used for both ”#(floating point operations)” and ”#(floating point operations) / second”
• Calculation:
• Calculate the #(FLOPs) per FPU and cycle
• Sum up and multiply by the frequency (CPU cycles / second)
15
Department of Informatics V
Alex Breuer
Peak through Vectorization• SNB (SuperMUC, E5-2680):
• 256-bit FMUL and 256-bit FADD → 8 #(FLOPs) / (cycle and core)
• → 2.7 ∗ 10^9 ∗ 8 = 21.6 DP-GFLOPS / core
• → 8 ∗ 21.6 = 172.8 DP-GFLOPS / processor
• Haswell (My notebook, i7-4650U)
• Two 256-bit FMAs → 16 #(FLOPs) / (cycle and core)
• → 1.7 ∗ 10^9 ∗ 16 = 27.2 DP-GFLOPS / core
• → 2 ∗ 27.2 = 54.4 DP-GFLOPs / processor
!
• Future: Lower frequencies, longer vector units, more cores
16
SuperMUChttps://www.lrz.de/presse/
fotos/
Department of Informatics V
Alex Breuer
Simple Loop: Scalar
17
Department of Informatics V
Example: Read After Write
Code
32 // multiply input a and b33 for( int l_i = 1; l_i < VECTOR_SIZE; l_i++ ) {34 l_a[l_i] = l_a[l_i -1] * l_b[l_i];35 }
code: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/read_after_write.cpp
Vector-report
1 read_after_write.cpp (28): (col. 3) remark: vectorization �� support: unroll factor set to 4.2 read_after_write.cpp (28): (col. 3) remark: LOOP WAS VECTORIZED.3 read_after_write.cpp (33): (col. 3) remark: loop was not �� vectorized: existence of vector dependence.4 read_after_write.cpp (34): (col. 5) remark: vector dependence: �� assumed FLOW dependence between l_a line 34 and l_a line 34.5 read_after_write.cpp (39): (col. 15) remark: vectorization �� support: call to function _ZNSolsEd cannot be vectorized.
output: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/vector_reports/intel_
read_after_write_vec_report.log
Alexander Breuer, Tobias Weinzierl: Advanced Programming
Autovectorization, December 05, 2013 53
a
b
c
a
b
c
0a
b
c
1
a
b
c
2
Department of Informatics V
Alex Breuer
Simple Loop: SIMD
18
Department of Informatics V
Vectorized Execution (2)
32 // multiply input a and b33 for( int l_i = 0; l_i < VECTOR_SIZE; l_i++ ) {34 l_c[l_i] = l_a[l_i] * l_b[l_i];35 }
code: https://github.com/TUM-I5/advanced_programming/tree/master/lectures/auto_vectorization/simple_loop.cpp
Alexander Breuer, Tobias Weinzierl: Advanced Programming
Autovectorization, December 05, 2013 17
a
b
c
a
b
c
0
a
b
c
4
a
b
c
8
Department of Informatics V
Alex Breuer
Autovectorization• Intel Compiler Pragmas for Intel64 Vectorization:
• #pragma vector always ⇒ vectorize in any case (even if compiler detects inefficiencies)
• #pragma ivdep ⇒ disable some dependency checks during compilation
• #pragma simd ⇒ aggressive vectorization, might result in wrong code
• further information (SSE & AVX):
• Intel Architecture Optimization Reference Manual
• Intel C++ Compiler User’s Guide
19
Technische Universitat Munchen
Auto-Vectorization#pragma vector always
#pragma ivdep
for(i=1;i<n;i++) {
c[i]=s
*
c[i];
}
Intel Compiler Pragmas for Intel64 Vectorization:• #pragma vector always) vectorize in any case (even if compiler
detects inefficiencies)• #pragma ivdep) disable some dependency checks during
compilation• #pragma simd) aggressive vectorization, might result in wrong
code!further information (SSE & AVX):
• Intel Architecture Optimization Reference Manual• Intel C++ Compiler User’s Guide
Michael Bader Alexander Heinecke Alexander Breuer: Masterpraktikum Scientific Computing
Introduction, Caches, Vectorization, October 22, 2013 31
Department of Informatics V
Alex Breuer
Dependency Analysis
• no dependency across iterations-> can be optimal and vectorized and pipelined
20
Technische Universitat Munchen
Dependency Analysis and Vectorization• Loop I
for(i=0;i<n;i++) {
a[i]=b[i]+c[i]
}
no dependency across iterations) can be optimally vectorizedand pipelined
• Loop IIfor(i=1;i<n-1;i++) {
c[i]=c[i]
*
c[i-1];
}
Recursion! ci�1 is still processed when calculating ci ) novectorization is possible
Michael Bader Alexander Heinecke Alexander Breuer: Masterpraktikum Scientific Computing
Introduction, Caches, Vectorization, October 22, 2013 32
Technische Universitat Munchen
Dependency Analysis and Vectorization• Loop I
for(i=0;i<n;i++) {
a[i]=b[i]+c[i]
}
no dependency across iterations) can be optimally vectorizedand pipelined
• Loop IIfor(i=1;i<n-1;i++) {
c[i]=c[i]
*
c[i-1];
}
Recursion! ci�1 is still processed when calculating ci ) novectorization is possible
Michael Bader Alexander Heinecke Alexander Breuer: Masterpraktikum Scientific Computing
Introduction, Caches, Vectorization, October 22, 2013 32
• Recursion! c[i-1] still processed when calculating c[i]-> no vectorization is possible
Department of Informatics V
Alex Breuer
Intrinsic Functions• Suffixes:• ps, pd→packed single- and double-precision floating point functions• ss, sd→scalar single- and double-precision fp functions
• Documentation of all functions in chapter “Intel C++ Intrinsics Reference” of Intel C++ Compiler User’s Guide
21
__m128i; __m128d; // register-variables SSE
__m256i; __m256d; // register-variables AVX
!
__m128d _mm_instr_pd(__m128d a, __m128d b)
__m256d _mm_instr_pd(__m256d a, __m256d b)
Department of Informatics V
Alex Breuer
Example
22
#include <immintrin.h>int avx_add(int length, double* a, double* b, double* c) { int i; __m256d t0,t1; for (i = 0; i < length; i +=4) { // load four 64-bit-double values t0 = _mm256_load_pd(a[i]); t1 = _mm256_load_pd(b[i]); // add four64-bit-double values t0 = _mm256_add_pd(t0,t1); // store 4 64-bit-double values _m256m_store_pd(c[i],t0); }}
Department of Informatics V
Alex Breuer
MAC Cluster• logins: one login per participant
• change password through: https://idportal.lrz.de/r/entry.pl: Passwort ändern on left side, type old one, repeat your new one two times and hit Passwort ändern afterwards.
• ssh [email protected] or [email protected], within the MWN (VPN client or computing hall)
• read the motd carefully
• module system: module load, list, unload, info
• Documentation & policies
• http://www.mac.tum.de/wiki/index.php/MAC_Cluster
• http://www.lrz.de/services/compute/linux-cluster/intro
23