3. accelerators 3.1 overview - computer architecture · pdf filematrox mystique, 3dfx voodoo...

Lehrstuhl Informatik 3 - Prof. D. Fey

Vorlesung Computer Architectures for Medical Applications

SS 2013, 25.6.-2.7.2013 Folie 1

3. Accelerators

3.1 Overview

Accelerator cards are typically PCIx cards that

supplement a host processor, which they require to operate

Today, the most common accelerators include

• GPUs (Graphics Processing Units)

AMD/ATI

Nvidia Fermi, Kepler

• Intel Xeon Phi

Knights Landing

Knights Corner



SS 2013, 25.6.-2.7.2013 Folie 2

3. Accelerators

3.1 Accelerators in comparison to a modern CPU

* Summer 2014



SS 2013, 25.6.-2.7.2013 Folie 3

3. Accelerators

3.2.1 Introduction to GPUs

A short history of graphics cards

• originally: graphics card intended to access monitor

• mid 80s: graphics cards provide 2D acceleration

• early 90s: first attempts at 3D acceleration:

Matrox Mystique, 3dfx Voodoo

Rasterization of polygons



SS 2013, 25.6.-2.7.2013 Folie 4

3. Accelerators


A short history of graphics cards

• Originally, no standardized interface

• Vendor-specific Solutions

3dfx Glide

Matrox Simple Interface

• At the beginning of the 90s

OpenGL establishes itself for professional use

Microsoft‘s Direct3D inferior at first

– but wins market share through frequent improvements

• End of 90s:

Graphics cards take care of coordinate transformation and lighting

– e.g. NVIDIA GeForce 256 with hardware T&L

The term „Graphics Processing Unit“ is coined



SS 2013, 25.6.-2.7.2013 Folie 5

3. Accelerators


2000:

• Initially, only a Fixed-Function-Pipeline (FFP) existed

• Shader programs offer more flexibility

Pixel-Shader model surfaces

Vertex-Shader modify vertices (e.g. of a mesh representing some obj.)

• Initially, shader programs can only contain a list of instructions

• 2002: ATI Radeon 9700 can execute loops in shaders

Today:

• Shader turing-complete (can execute arbitrary programs)

• Manufacturers: AMD (ATI) und Nvidia

• Mass consumer market → low prices

GeForce: cheap consumer market card, focus on single precision

Tesla: more expensive card intended for scientific computing



SS 2013, 25.6.-2.7.2013 Folie 6

3. Accelerators

3.2.2 GPU Architecture: Streaming Multiprocessor

Architecture of a so-called Streaming Multiprocessor (SM)



SS 2013, 25.6.-2.7.2013 Folie 7

3. Accelerators

3.2.2 GPU Architecture: Streaming Multiprocessor (Fermi)

Fermi‘s SM

• 32 CUDA „Cores“

Each one can perform integer

and single-precision floating-point

operations

Equivalent to a CPU core with

32-way SIMD

– Only one PC per SM!

• 16 Load/Store Units

• 4 Special Function Units (SFUs)

computes sin, cos, tan, exp, etc...

• Large Register File

• 64KB Shared Memory + L1

16+48 / 48+16 configurable



SS 2013, 25.6.-2.7.2013 Folie 8

3. Accelerators

3.2.2 GPU Architecture: Fermi

Nvidia Fermi Accelerators contain up

to 16 Streaming Multiprocessors

The accelerator also contains its own memory



SS 2013, 25.6.-2.7.2013 Folie 9

3. Accelerators

3.2.2 GPU Architecture: Kepler

Streaming Multiprocessor (SMX)

192 SP CUDA cores per SMX • 192 fp32 ops/clock

• 192 int32 ops/clock

64 DP CUDA cores per SMX • 64 fp64 ops/clock

4 warp schedulers • Up to 2048 threads concurrently

32 special-function units

64KB shared memory + L1 cache

48KB read-only data cache

64K 32-bit registers

Up to 15 SMX per chip (K20)



SS 2013, 25.6.-2.7.2013 Folie 10

3. Accelerators

3.2.2 GPU Architecture: Summary

High number of compute units working in parallel

Grouped onto SMs, which follow the SIMT-principle

Performance penalties for non-uniform branching

• Serialization with predication

No branch prediction

Many registers

Own main memory (up to 6 GB)

• Latency: ~400 clock cycles

• small, fast on-chip Shared-Memory-blocks

Caches / explicitly addressable scratchpad memory (16 - 48 KB)



SS 2013, 25.6.-2.7.2013 Folie 11

3. Accelerators

3.2.3 GPU Programming

Thousands of threads running in parallel

Threads are grouped into thread blocks, which make up a grid

• See next slide

Each thread block is assigned to a streaming multiprocessor

• There can be multiple thread blocks assigned to the same SM

• 32 (Fermi) threads of a thread block are called a warp

• A warp is executed in parallel on the 32 (Fermi) CUDA cores

• The warp scheduler can inexpensively switch between different warps of a thread

blocks

Remember: On a CPU, changing a thread is expensive, as it requires saving the process

state (e.g. registers, etc.)

Standards for GPU programming

• CUDA (Nvidia)

• OpenCL (Khronos group)



SS 2013, 25.6.-2.7.2013 Folie 12

3. Accelerators

3.2.3 GPU Programming: Memories

Different memory types of a GPU

• Global Memory

„Main Memory“ of the GPU

– If data from the CPU‘s main memory is required, it has to be transfered to

the device‘s global memory (PCIe bus only tens of GB/s vs.

hundrets of GB/s of GDDR)

Uses SRAM technology

Typically high latency

Higher bandwidth than CPU main memory

– Higher memory clock

• Shared Memory

SRAM memory on SM

Faster than global memory

Data can only be shared within a thread block



SS 2013, 25.6.-2.7.2013 Folie 13

3. Accelerators


• Local Memory

A thread‘s own memory

If only a small amount of memory is required, this can be allocated in

the very fast register file; otherwise, the memory is allocated in the

much slower global memory

– Access to global memory is cached in the global L2 cache and the SM-local

L1 cache

• Special Memories

Texture Memory

Constant Memory

Both are constant memories, that have to be written to by the CPU

before a program is executed on the GPU



SS 2013, 25.6.-2.7.2013 Folie 14

3. Accelerators


Memory Hierarchy

• Registers (fastest)

• Shared Memory/L1 Cache

Either 16 KB Cache

and 48 KB Shared Memory;

or48 KB Cache

und 16 KB Shared Memory

• L2 Cache

768 KB

• DRAM (slowest)

1-6 GB



SS 2013, 25.6.-2.7.2013 Folie 15

3. Accelerators

3.2.3 GPU Programming: Thread Hierarchy

A thread block is made up of multiple threads

A grid is composed of thread blocks



SS 2013, 25.6.-2.7.2013 Folie 16

3. Accelerators

3.2.3 GPU Programming: Thread Hierarchy

Each thread is assigned a unique thread identification

• Thread ID in multiple dimensions



SS 2013, 25.6.-2.7.2013 Folie 17

3. Accelerators

3.2.3 GPU Programming: CUDA

Can be programmed in C/C++

Code is separated into host and device sections through the use of

qualifiers

• __host__ / __global__ functions are run on the host (CPU)

• __device__ functions are run on the GPU

Memory type is also chosen via qualifiers

• Host memory as usual in C/C++

• __device__ : global GPU memory

• __shared__: shared memory of a SM

CUDA-API for memory operations

• cudaMalloc/cudaFree: Memory allocation and deallocation

• cudaMemcpy: Memory transfers (CPU-GPU, GPU-GPU, GPU-CPU)

Nvidias nvcc compiles your CUDA file and produces an

executable that can be run on the host



SS 2013, 25.6.-2.7.2013 Folie 18

3. Accelerators


• Heterogeneous Programming Model

• Serial Code runs on CPU

• Parallel Code runs on GPU

• Grid and ThreadBlock dimensions can be

chosen freely

• dim3 dimBlock0(1,12) ;

• dim3 Grid0(2,3) ;

• Kernel0<<<dimGrid0,dimBlock0>>>(…, …, …);

reps.

• dim3 dimBlock1(1,9);

• dim3 Grid1(3,2);

• Kernel1<<<dimGrid,1dimBlock1>>>(…, …, …);



SS 2013, 25.6.-2.7.2013 Folie 19

3. Accelerators


Cuda Memory Management

• CUDA API calls

Allocation/Deallocation of GPU-RAM

Transfer GPU-RAM -> CPU-RAM, CPU-RAM -> GPU-RAM

• (CUDA)-Kernels:

Transfer GPU-RAM <-> Shared-Memory (SM)

cudaMalloc(… , …) ;

cudaMemcpy(…, …, cudaMemcpyDeviceToHost) ;

cudaMemcpy(…, …, cudaMemcpyHostToDevice) ;

__shared__ float As[BLOCK_SIZE] ;

As[row] = GetElement(..., ...) ;



SS 2013, 25.6.-2.7.2013 Folie 20

3. Accelerators


1. CUDA – Example: Hello World



SS 2013, 25.6.-2.7.2013 Folie 21

3. Accelerators


CUDA: Hello World:



SS 2013, 25.6.-2.7.2013 Folie 22

3. Accelerators


CUDA: Vector Addition

// Device code

__global__ void VecAdd(float* A, float* B, float* C)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}



SS 2013, 25.6.-2.7.2013 Folie 23

3. Accelerators



// Host code

int main() {

int N = ...;

size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memory

float* h_A = (float *) malloc(size);

float* h_B = (float *) malloc(size);

float* h_C = (float *) malloc(size);

// Enter values in h_A and h_B

...

// Allocate vectors in device memory

float* d_A, d_B, d_C;

cudaMalloc((float**)&d_A, size);

cudaMalloc((float**)&d_B, size);

cudaMalloc((float**)&d_C, size);

// Copy vectors from host memory to device memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);



SS 2013, 25.6.-2.7.2013 Folie 24

3. GPGPUs

3.3.3 CUDA-Programmierung


// Invoke kernel

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);

// Copy result from device memory to host memory

// h_C contains the result in host memory

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Output h_C

...

// Free device memory

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

}



SS 2013, 25.6.-2.7.2013 Folie 25

3. GPGPUs

3.3.4 GPU Summary

GPUs have a lot of simple „CUDA Cores/shaders“

• Not equivalent to CPU cores!

They are programmed via function offloading

High number of threads required

• To efficiently use the highly parallel hardware and hide memory

latency

Threads grouped into Thread Blocks, Thread Blocks

grouped into Grid

On-chip shared memory can be used as fast buffer

Transfering of data between CPU and GPU memory via

API calls



SS 2013, 25.6.-2.7.2013 Folie 26

4. Design of a X-Ray control HW-/SW system



SS 2013, 25.6.-2.7.2013 Folie 27


More fine-grain presentation of the data path



SS 2013, 25.6.-2.7.2013 Folie 28


More fine-grain presentation



SS 2013, 25.6.-2.7.2013 Folie 29


Specific Design Language (SDL)

to SystemC mapping steps



SS 2013, 25.6.-2.7.2013 Folie 30




SS 2013, 25.6.-2.7.2013 Folie 31

3. Accelerators

3.4 Intel‘s Xeon Phi

Was originally developed in 2006 as a GPU

to compete with Nvidia’s Tesla series

Rebranded as coprocessor for

numerical applications in 2010

Official launch in 2013

Main advantage vs. GPUs

• Don’t have to rewrite code (e.g. CUDA/OpenCL)

just recompile existing C, C++, Fortran code

• “Standard” CPU programming and optimizations apply

Multicore Parallelization (OpenMP), Vectorization, …



SS 2013, 25.6.-2.7.2013 Folie 32

3. Accelerators

3.4.1 Intel‘s Xeon Phi – Overview

PCIe accelerator card

60 cores with private L1

and L2 caches

Cache Coherence using

Distributed Tag Directory

High-Bandwidth Ring

Interconnect

8 GDDR5 Memory

Controllers (5 GHz)



SS 2013, 25.6.-2.7.2013 Folie 33

3. Accelerators

3.4.2 Intel‘s Xeon Phi – Core Design

33

Based on original Pentium P54C in-order design from 1995

Clock Frequency 1.05 GHz

Supports 4-SMT • Requires 2 Threads to achieve

Peak Performance due to Pipelined Decoder

32 KiB L1D/L1I Cache • Read or Write 512 bit/cycle

512 KiB L2 Cache • Read or Write 256 bit/cycle

Super-Scalar Design • V-Pipe (x64 compatibility)

• U-Pipe (connects to VPU)

Vector Processing Unit (VPU) • 32x 512 bit Vector Registers

• Supports Fused Multiply-Add

• Vector Scatter/Gather



SS 2013, 25.6.-2.7.2013 Folie 34

3. Accelerators

3.4.3 Intel‘s Xeon Phi – 512 bit VPU

Regular CPU

• SSE (128bit)

• AVX/AVX2 (256bit)

Xeon Phi

• IMCI (512bit)



SS 2013, 25.6.-2.7.2013 Folie 35

3. Accelerators

3.4.3 Intel‘s Xeon Phi – 512 bit VPU

Includes Extended Math Unit for special operations

• sin, cos, rcp, etc. (see also, GPUs’ SFUs)

Supports Fused Multiply-Add (FMA)

Mask registers for conditional execution



SS 2013, 25.6.-2.7.2013 Folie 36

3. Accelerators

3.4.4 Intel‘s Xeon Phi – Summary

Use small, simple in-order cores to fit many cores

onto chip

Extend simple cores with powerful 512 bit vector progreccing

unit (VPU) to achieve high performance (Flop/s)

High bandwidth ring interconnect to connect the cores

Fast on-device GDDR5 memory

No CUDA/other code rewriting necessary

Just recompile existing code

If code was already optimized for a modern CPU (multicore,

vectorization) it should automatically deliver good

performance on the Xeon Phi as well

3. accelerators 3.1 overview - computer architecture · pdf filematrox mystique, 3dfx voodoo...

Documents