trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre

1

Trial Lecture

The Use of GPUs for High-Performance Computing

12. October 2010

Magnus Jahre

2

Graphic Processors (GPUs)

• Modern computers are graphics intensive

• Advanced 3D graphics require a significant amount of computation

Graphics Card (Source: nvidia.com)

Solution: Add a Graphics Processor (GPU)

3

High-Performance Computing

High-Performance Computing (HPC)

General Purpose Programming on GPUs (GPGPU)

Efficient use of computers for computationally intensive problems in science or engineering

Pro

cess

ing

Dem

and

Communication Demand

Weather forecastClimate modeling

Dynamic Molecular Simulation

Computational Computer

Architecture

Office Applications

Third dimension:

Main Memory Capacity

4

Outline

• GPU Evolution

• GPU Programming

• GPU Architecture

• Achieving High GPU Performance

• Future Trends

• Conclusions

5

GPU EVOLUTION

6

First GPUs: Fixed Hardware

[Blythe 2008]

Vertex Processing

RasterizationFragment

ProcessingFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

7

Programmable Shaders

Vertex Processing





Motivation: More flexible graphics processing

8

GPGPU with Programmable Shaders

Vertex Processing





Use Graphics Library to gain access to GPU

Use color values to code data

Effect of fixed function stages must be accounted for

9

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex Processing

Fragment Processing

RasterizationFramebuffer Operations



10

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex intensive shader

Fragment intensive shader

Unified shader

11

Unified Shader Architecture• Exploit parallelism

– Data parallelism– Task parallelism

• Data parallel processing (SIMD/SIMT)

• Hide memory latencies

• High bandwidth

Architecture naturally supports GPGPU

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

Thread Scheduler

Interconnect

On-Chip Memory or Cache

Off-Chip DRAM Memory

12

GPU PROGRAMMING

13

Programmable Shaders Unified Shaders

GPGPU Tool Support

Sh

PeakStreamAccelerator

GPU++

CUDA

OpenCL

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

3

1

GPU papers on Supercomputing

14

Compute Unified Device Architecture (CUDA)• Most code is normal C+

+ code

• Code to run on GPU organized in kernels

• CPU sets up and manages computation

__global__ void vector_add(float* a, float* b, float* c){ int idx = threadIdx.x; c[idx] = a[idx] + b[idx];}

int main(){ int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ...}

15

Thread/Data Organization• Hierarchical thread

organization– Grid– Block– Thread

• A block can have a maximum of 512 threads

• 1D, 2D and 3D mappings possible

Block (0,0)

Block (0,1)

Block (0,2)

Block (1,0)

Block (1,1)

Block (1,2)

Grid

Block (0)

Block (1)

Grid

16

C

B

A A

B

Global Memory Main MemoryGPU CPU

SP SP SP SP

SP SP SP SP

A

B

CC

Local Memory

Vector Addition Example

A collection of concurrently processed threads is called a warp

17

Terminology: Warp

18

Vector Addition Profile

• Only 11% of GPU time is used to add vectors

• The arithmetic intensity of the problem is too low

• Overlapping data copy and computation could help

11%

58%

32%

%GPU time

vector_add memcpyHtoDmemcpyDtoH

Hardware: NVIDIA MVS 3100M

19

Will GPUs Save the World?

• Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010]

• GPGPU has provided nice speedups for problems that fit the architecture

• Metric challenge: The practitioner needs performance per developer hour

20

GPU ARCHITECTURE

21

NVIDIA Tesla Architecture

Figure reproduced from [Lindholm et al.; 2008]

22

Control Flow

• The threads in a warp share use the same instruction

• Branching is efficient if all threads in a warp branch in the same direction

• Divergent branches within a warp cause serial execution of both paths

IF

Condition True Threads

Condition False Threads

Condition True Threads

Condition False Threads

23

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

DRAM

Banks

Row Buffer

Rows

Co

lum

ns

24

Access Coalescing

• Global memory accesses from all threads in a half-warp are combined into a single memory transaction

• All memory elements in a segment are accessed

• Segment size can be halved if only the lower or upper half is used

Assumes Compute Capability 1.2 or higher

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Addr 128

Addr 132

Addr 136

Addr 140

Addr 144

Addr 148

Addr 152

Addr 156

Addr 124

Addr 120

Addr 116

Addr 112Tran

saction

Transactio

n

25

Bank Conflicts

• Memory banks can service requests independently

• Bank conflict: more than one thread access a bank concurrently

• Strided access patterns can cause bank conflicts

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 5

Bank 6

Bank 7

Stride two accesses gives 2-way bank conflict

26

NVIDIA Fermi• Next generation computing

chip from NVIDIA

• Aims to alleviate important bottlenecks– Improved double precision

floating point support– Cache hierarchy– Concurrent kernel execution

• More problems can be solved efficiently on a GPU

Figure reproduced from [NVIDIA; 2010]

27

ACHIEVING HIGH GPU PERFORMANCE

28

Which problems fit the GPU model?

• Fine-grained data parallelism available• Sufficient arithmetic intensity• Sufficiently regular data access patterns

It’s all about organizing data

Optimized memory system use enables high performance

29

Increase Computational Intensity• Memory types:

– On-chip shared memory: Small and fast

– Off-chip global memory: Large and slow

• Technique: Tiling– Choose tile size such

that it fits in the shared memory

– Increases locality by reducing reuse distance

A × B = C

×

=

Reuse!

Reuse!

30

Memory Layout

• Exploit coalescing to achieve high bandwidth

• Linear access necessary

• Solution: Tiling

A × B = C

×

=

Assume row-major storage

Coalesced Not Coalesced

31

W1 W2 W3 W4W1 W2 W3 W4

Avoid Branching Inside Warps

Assume 2 threads per warp

All iterations diverge

8

4 4

One iteration diverges

8

11 1 1 1 1 1 1 1

2 2 2 2

44

1 1 1 1 1 1 1 1

2 2 2 2

32

Automation

• Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08]– Avoid saturation– Sweet spot will vary between devices– Sweet spot varies with problem sizes

• Auto-tuning 3D FFT [Nukada et al.; SC2009]– Balance resource consumption vs. parallelism with kernel radix and

ordering – Best number of thread blocks chosen automatically– Inserts padding to avoid shared memory bank conflicts

33

Case Study: Dynamic Molecular Simulation with NAMD

Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

34

Key Performance Enablers

• Careful division of labor between GPU and CPU– GPU: Short range non-bonded forces– CPU: Long-range electrostatic forces and coordinate updates

• Overlap CPU and GPU execution through asynchronous kernel execution

• Use event recording to track progress in asynchronously executing streams

[Phillips et al., SC2008]

35

CPU/GPU Cooperation in NAMD


CPU

GPU

Remote Local Local Update

Remote Local

Time

ff

f f

x

x x

36

Challenges

• Completely restructuring legacy software systems is prohibitive

• Batch processing software are unaware of GPUs

• Interoperability issues with pinning main memory pages for DMA


37

FUTURE TRENDS

38

Accelerator Integration• Industry move towards integrating

CPUs and GPUs on the same chip– AMD Fusion [Brookwood; 2010]– Intel Sandy Bridge (fixed function

GPU)

• Are other accelerators appropriate?– Single-chip Heterogeneous

Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010]

AMD FusionReproduced from [Brookwood; 2010]

39

Vector Addition Revisited

Start-up and shut-down data transfers are the main bottleneck

Fusion eliminates these overheads by storing values in the on-chip cache

Using accelerators becomes more feasible

40

Memory System Scalability

• Current CPU bottlenecks:– Number of pins on a chip grows slowly– Off-chip bandwidth grows slowly

• Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand

• Conflicting requirements:– GPU: High bandwidth, not latency sensitive– CPU: High bandwidth, can be latency sensitive

41

CONCLUSIONS

42

Conclusions

• GPUs can offer a significant speedup for problems that fit the model

• Tool support and flexible architectures increases the number of problems that fit the model

• CPU/GPU on-chip integration can reduce GPU start-up overheads

43

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

44

References• Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on

CPU and GPU; Lee et al.; ISCA; 2010• Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010• NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA;

2010• AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White

Paper; AMD; 2010• Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master

Thesis; NTNU; 2010• Complexity Effective Memory Access Scheduling for Many-Core Accelerator

Architectures; Yuan et al.; MICRO; 2009• Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009• Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009• Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips

et al.; SC; 2008• Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008• NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE

Micro; 2008• Optimization Principles and Application Performance Evaluation of a Multithreaded

GPU using CUDA; Ryoo et al.; PPoPP; 2008

45

EXTRA SLIDES

46

Complexity-Effective Memory Access Scheduling• On-chip interconnect

may interleave requests from different thread processors

• Row locality is destroyed

• Solution: Order-preserving interconnect arbitration policy and in-order scheduling

[Lee et al., MICRO2009]

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Row Switch

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Row Switch

Req 1Row B

Time

In-order Scheduling

Out-of-order Scheduling

Queue:

Req 0Row A

Req 0Row B

Req 1Row A

Req 1Row B

Req 1Row A

Req 0Row B

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Performance of out-of-order scheduling with less complex in-order scheduling

trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre

Documents

d graphics

graphics processor gpu

gpu time

graphics intensive advanced

flexible graphics processing

normal c code code

d mappings possibleblock

main memory capacity