3. accelerators 3.1 overview - computer architecture · pdf filematrox mystique, 3dfx voodoo...
TRANSCRIPT
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 1
3. Accelerators
3.1 Overview
Accelerator cards are typically PCIx cards that
supplement a host processor, which they require to operate
Today, the most common accelerators include
• GPUs (Graphics Processing Units)
AMD/ATI
Nvidia Fermi, Kepler
• Intel Xeon Phi
Knights Landing
Knights Corner
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 2
3. Accelerators
3.1 Accelerators in comparison to a modern CPU
* Summer 2014
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 3
3. Accelerators
3.2.1 Introduction to GPUs
A short history of graphics cards
• originally: graphics card intended to access monitor
• mid 80s: graphics cards provide 2D acceleration
• early 90s: first attempts at 3D acceleration:
Matrox Mystique, 3dfx Voodoo
Rasterization of polygons
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 4
3. Accelerators
3.2.1 Introduction to GPUs
A short history of graphics cards
• Originally, no standardized interface
• Vendor-specific Solutions
3dfx Glide
Matrox Simple Interface
• At the beginning of the 90s
OpenGL establishes itself for professional use
Microsoft‘s Direct3D inferior at first
– but wins market share through frequent improvements
• End of 90s:
Graphics cards take care of coordinate transformation and lighting
– e.g. NVIDIA GeForce 256 with hardware T&L
The term „Graphics Processing Unit“ is coined
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 5
3. Accelerators
3.2.1 Introduction to GPUs
2000:
• Initially, only a Fixed-Function-Pipeline (FFP) existed
• Shader programs offer more flexibility
Pixel-Shader model surfaces
Vertex-Shader modify vertices (e.g. of a mesh representing some obj.)
• Initially, shader programs can only contain a list of instructions
• 2002: ATI Radeon 9700 can execute loops in shaders
Today:
• Shader turing-complete (can execute arbitrary programs)
• Manufacturers: AMD (ATI) und Nvidia
• Mass consumer market → low prices
GeForce: cheap consumer market card, focus on single precision
Tesla: more expensive card intended for scientific computing
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 6
3. Accelerators
3.2.2 GPU Architecture: Streaming Multiprocessor
Architecture of a so-called Streaming Multiprocessor (SM)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 7
3. Accelerators
3.2.2 GPU Architecture: Streaming Multiprocessor (Fermi)
Fermi‘s SM
• 32 CUDA „Cores“
Each one can perform integer
and single-precision floating-point
operations
Equivalent to a CPU core with
32-way SIMD
– Only one PC per SM!
• 16 Load/Store Units
• 4 Special Function Units (SFUs)
computes sin, cos, tan, exp, etc...
• Large Register File
• 64KB Shared Memory + L1
16+48 / 48+16 configurable
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 8
3. Accelerators
3.2.2 GPU Architecture: Fermi
Nvidia Fermi Accelerators contain up
to 16 Streaming Multiprocessors
The accelerator also contains its own memory
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 9
3. Accelerators
3.2.2 GPU Architecture: Kepler
Streaming Multiprocessor (SMX)
192 SP CUDA cores per SMX • 192 fp32 ops/clock
• 192 int32 ops/clock
64 DP CUDA cores per SMX • 64 fp64 ops/clock
4 warp schedulers • Up to 2048 threads concurrently
32 special-function units
64KB shared memory + L1 cache
48KB read-only data cache
64K 32-bit registers
Up to 15 SMX per chip (K20)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 10
3. Accelerators
3.2.2 GPU Architecture: Summary
High number of compute units working in parallel
Grouped onto SMs, which follow the SIMT-principle
Performance penalties for non-uniform branching
• Serialization with predication
No branch prediction
Many registers
Own main memory (up to 6 GB)
• Latency: ~400 clock cycles
• small, fast on-chip Shared-Memory-blocks
Caches / explicitly addressable scratchpad memory (16 - 48 KB)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 11
3. Accelerators
3.2.3 GPU Programming
Thousands of threads running in parallel
Threads are grouped into thread blocks, which make up a grid
• See next slide
Each thread block is assigned to a streaming multiprocessor
• There can be multiple thread blocks assigned to the same SM
• 32 (Fermi) threads of a thread block are called a warp
• A warp is executed in parallel on the 32 (Fermi) CUDA cores
• The warp scheduler can inexpensively switch between different warps of a thread
blocks
Remember: On a CPU, changing a thread is expensive, as it requires saving the process
state (e.g. registers, etc.)
Standards for GPU programming
• CUDA (Nvidia)
• OpenCL (Khronos group)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 12
3. Accelerators
3.2.3 GPU Programming: Memories
Different memory types of a GPU
• Global Memory
„Main Memory“ of the GPU
– If data from the CPU‘s main memory is required, it has to be transfered to
the device‘s global memory (PCIe bus only tens of GB/s vs.
hundrets of GB/s of GDDR)
Uses SRAM technology
Typically high latency
Higher bandwidth than CPU main memory
– Higher memory clock
• Shared Memory
SRAM memory on SM
Faster than global memory
Data can only be shared within a thread block
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 13
3. Accelerators
3.2.3 GPU Programming: Memories
• Local Memory
A thread‘s own memory
If only a small amount of memory is required, this can be allocated in
the very fast register file; otherwise, the memory is allocated in the
much slower global memory
– Access to global memory is cached in the global L2 cache and the SM-local
L1 cache
• Special Memories
Texture Memory
Constant Memory
Both are constant memories, that have to be written to by the CPU
before a program is executed on the GPU
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 14
3. Accelerators
3.2.3 GPU Programming: Memories
Memory Hierarchy
• Registers (fastest)
• Shared Memory/L1 Cache
Either 16 KB Cache
and 48 KB Shared Memory;
or48 KB Cache
und 16 KB Shared Memory
• L2 Cache
768 KB
• DRAM (slowest)
1-6 GB
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 15
3. Accelerators
3.2.3 GPU Programming: Thread Hierarchy
A thread block is made up of multiple threads
A grid is composed of thread blocks
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 16
3. Accelerators
3.2.3 GPU Programming: Thread Hierarchy
Each thread is assigned a unique thread identification
• Thread ID in multiple dimensions
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 17
3. Accelerators
3.2.3 GPU Programming: CUDA
Can be programmed in C/C++
Code is separated into host and device sections through the use of
qualifiers
• __host__ / __global__ functions are run on the host (CPU)
• __device__ functions are run on the GPU
Memory type is also chosen via qualifiers
• Host memory as usual in C/C++
• __device__ : global GPU memory
• __shared__: shared memory of a SM
CUDA-API for memory operations
• cudaMalloc/cudaFree: Memory allocation and deallocation
• cudaMemcpy: Memory transfers (CPU-GPU, GPU-GPU, GPU-CPU)
Nvidias nvcc compiles your CUDA file and produces an
executable that can be run on the host
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 18
3. Accelerators
3.2.3 GPU Programming: CUDA
• Heterogeneous Programming Model
• Serial Code runs on CPU
• Parallel Code runs on GPU
• Grid and ThreadBlock dimensions can be
chosen freely
• dim3 dimBlock0(1,12) ;
• dim3 Grid0(2,3) ;
• Kernel0<<<dimGrid0,dimBlock0>>>(…, …, …);
reps.
• dim3 dimBlock1(1,9);
• dim3 Grid1(3,2);
• Kernel1<<<dimGrid,1dimBlock1>>>(…, …, …);
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 19
3. Accelerators
3.2.3 GPU Programming: CUDA
Cuda Memory Management
• CUDA API calls
Allocation/Deallocation of GPU-RAM
Transfer GPU-RAM -> CPU-RAM, CPU-RAM -> GPU-RAM
• (CUDA)-Kernels:
Transfer GPU-RAM <-> Shared-Memory (SM)
cudaMalloc(… , …) ;
cudaMemcpy(…, …, cudaMemcpyDeviceToHost) ;
cudaMemcpy(…, …, cudaMemcpyHostToDevice) ;
__shared__ float As[BLOCK_SIZE] ;
As[row] = GetElement(..., ...) ;
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 20
3. Accelerators
3.2.3 GPU Programming: CUDA
1. CUDA – Example: Hello World
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 21
3. Accelerators
3.2.3 GPU Programming: CUDA
CUDA: Hello World:
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 22
3. Accelerators
3.2.3 GPU Programming: CUDA
CUDA: Vector Addition
// Device code
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 23
3. Accelerators
3.2.3 GPU Programming: CUDA
CUDA: Vector Addition
// Host code
int main() {
int N = ...;
size_t size = N * sizeof(float);
// Allocate input vectors h_A and h_B in host memory
float* h_A = (float *) malloc(size);
float* h_B = (float *) malloc(size);
float* h_C = (float *) malloc(size);
// Enter values in h_A and h_B
...
// Allocate vectors in device memory
float* d_A, d_B, d_C;
cudaMalloc((float**)&d_A, size);
cudaMalloc((float**)&d_B, size);
cudaMalloc((float**)&d_C, size);
// Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 24
3. GPGPUs
3.3.3 CUDA-Programmierung
CUDA: Vector Addition
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);
// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Output h_C
...
// Free device memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 25
3. GPGPUs
3.3.4 GPU Summary
GPUs have a lot of simple „CUDA Cores/shaders“
• Not equivalent to CPU cores!
They are programmed via function offloading
High number of threads required
• To efficiently use the highly parallel hardware and hide memory
latency
Threads grouped into Thread Blocks, Thread Blocks
grouped into Grid
On-chip shared memory can be used as fast buffer
Transfering of data between CPU and GPU memory via
API calls
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 26
4. Design of a X-Ray control HW-/SW system
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 27
4. Design of a X-Ray control HW-/SW system
More fine-grain presentation of the data path
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 28
4. Design of a X-Ray control HW-/SW system
More fine-grain presentation
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 29
4. Design of a X-Ray control HW-/SW system
Specific Design Language (SDL)
to SystemC mapping steps
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 30
4. Design of a X-Ray control HW-/SW system
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 31
3. Accelerators
3.4 Intel‘s Xeon Phi
Was originally developed in 2006 as a GPU
to compete with Nvidia’s Tesla series
Rebranded as coprocessor for
numerical applications in 2010
Official launch in 2013
Main advantage vs. GPUs
• Don’t have to rewrite code (e.g. CUDA/OpenCL)
just recompile existing C, C++, Fortran code
• “Standard” CPU programming and optimizations apply
Multicore Parallelization (OpenMP), Vectorization, …
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 32
3. Accelerators
3.4.1 Intel‘s Xeon Phi – Overview
PCIe accelerator card
60 cores with private L1
and L2 caches
Cache Coherence using
Distributed Tag Directory
High-Bandwidth Ring
Interconnect
8 GDDR5 Memory
Controllers (5 GHz)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 33
3. Accelerators
3.4.2 Intel‘s Xeon Phi – Core Design
33
Based on original Pentium P54C in-order design from 1995
Clock Frequency 1.05 GHz
Supports 4-SMT • Requires 2 Threads to achieve
Peak Performance due to Pipelined Decoder
32 KiB L1D/L1I Cache • Read or Write 512 bit/cycle
512 KiB L2 Cache • Read or Write 256 bit/cycle
Super-Scalar Design • V-Pipe (x64 compatibility)
• U-Pipe (connects to VPU)
Vector Processing Unit (VPU) • 32x 512 bit Vector Registers
• Supports Fused Multiply-Add
• Vector Scatter/Gather
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 34
3. Accelerators
3.4.3 Intel‘s Xeon Phi – 512 bit VPU
Regular CPU
• SSE (128bit)
• AVX/AVX2 (256bit)
Xeon Phi
• IMCI (512bit)
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 35
3. Accelerators
3.4.3 Intel‘s Xeon Phi – 512 bit VPU
Includes Extended Math Unit for special operations
• sin, cos, rcp, etc. (see also, GPUs’ SFUs)
Supports Fused Multiply-Add (FMA)
Mask registers for conditional execution
Lehrstuhl Informatik 3 - Prof. D. Fey
Vorlesung Computer Architectures for Medical Applications
SS 2013, 25.6.-2.7.2013 Folie 36
3. Accelerators
3.4.4 Intel‘s Xeon Phi – Summary
Use small, simple in-order cores to fit many cores
onto chip
Extend simple cores with powerful 512 bit vector progreccing
unit (VPU) to achieve high performance (Flop/s)
High bandwidth ring interconnect to connect the cores
Fast on-device GDDR5 memory
No CUDA/other code rewriting necessary
Just recompile existing code
If code was already optimized for a modern CPU (multicore,
vectorization) it should automatically deliver good
performance on the Xeon Phi as well