general purpose computing on graphics processing units: optimization strategy henry au space and...

20
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific [email protected] 09/12/12 Distribution Statement

Upload: magnus-heath

Post on 13-Jan-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

General Purpose Computing on Graphics Processing Units: Optimization Strategy

Henry AuSpace and Naval Warfare Center Pacific

[email protected]/12/12

Distribution Statement

Page 2: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

2

Outline

▼ Background▼ NVIDIA’s CUDA▼ Decomposition & Porting▼ CUDA Optimizations▼ GPU Results▼ Conclusion

9/12/12

Page 3: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

3

Background

▼ Parallel Programming on GPUs General-Purpose Computation on

Graphics Processing Units (GPGPU) Compute Unified Device Architecture

(CUDA) Open Computing Language

(OpenCLTM)

9/12/12

Page 4: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

4

Background

▼ GPUs vs. CPUs GPU and CPU cores not the same CPU core is faster and more robust but, fewer cores GPU not as robust nor fast, but handles repetitive tasks quickly

▼ NVIDIA GeForce GTX 470 448 cores Memory Bandwidth = 133.9 GB/sec 544.32 GFLOPS DP

▼ Intel Core i7-965 4 cores Memory Bandwidth = 25.6 GB/sec 69.23 GFLOPS DP

9/12/12

Page 5: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

5

CUDA by NVIDIA

▼ Compute Unified Device Architecture Low and High Level API available C for CUDA High latency memory transfers Limited Cache Scalable programming model Requires NVIDIA graphics cards

9/12/12

Page 6: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

6

Decomposition and Porting

▼ Amdhal’s and Gustafson’s Law▼ Estimate Speed Up

P amount of parallel scaling achieved γ is the fraction of algorithm that is serial

9/12/12

Page 7: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

7

Decomposition and Porting

▼ TAU Profile Determine call paths and consider subroutine calls Pay attention to large for loops or redundant computations

▼ Visual Studio 2008 Initialize Profile: TAU_PROFILE(“StartFor”, “Main”, TAU_USER); Place Timers: − TAU_START(“FunctionName”)− TAU_STOP(“FunctionName”)

9/12/12

Page 8: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

8

Decomposition and Porting

▼ CUDA Overhead High latency associated with memory transfers Can be hidden with large amounts of mathematical computations Reduce Device to Host memory transfers−Many small transfers vs. fewer but larger transfers−Perform serial tasks using parallel processors

9/12/12

Page 9: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

9

CUDA Optimizations

▼ Thread and Block Occupancy Varies depending on graphics card

▼ Page Locked Memory cudaHostAlloc() Limited resource and should not be overused

▼ Streams A queue of GPU operations Such as GPU computation “kernels” and memory copies

▼ Asynchronous Memory Calls Ensure non-blocking calls cudaMemcpyAsync() or kernel call

9/12/12

Page 10: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

10

Thread Occupancy

▼ Ensure enough threads are operating at the same time 256 threads per block Max 1024 threads per block Monitor occupancy

9/12/12

ALF Threads Per Block vs Frames Per Second

4

16

102464144

36

196256

324 400484

576

40

45

50

55

60

65

70

75

80

85

90

0 200 400 600 800 1000 1200

Threads Per Block

AL

F F

ram

es P

er S

eco

nd

Pro

cess

ed

ALF FPS

Page 11: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

11

CUDA Optimizations

▼ Page Locked Host Memory cudaHostAlloc() vs. malloc vs. new

9/12/12

Processing Time Vs. Mega Bytes Data Processed

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

Data (MB)

Pro

cess

ing

Tim

e (m

s)

New

Malloc

cudaHostAlloc

Page 12: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

12

CUDA Optimizations

▼ Stream Structure Non-Optimized Processing time: 49.5ms

9/12/12

cudaMemcpyAsync(dataA0, stream0, HostToDevice)cudaMemcpyAsync(dataB0, stream0, HostToDevice)kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)cudaMemcpyAsync(result0, stream0, DeviceToHost)cudaMemcpyAsync(dataA1, stream1, HostToDevice)cudaMemcpyAsync(dataB1, stream1, HostToDevice)kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)cudaMemcpyAsync(result1, stream1, DeviceToHost)

Page 13: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

13

CUDA Optimizations

▼ Stream Structure Optimized Processing time: 49.4ms

9/12/12

cudaMemcpyAsync(dataA0, stream0, HostToDevice)cudaMemcpyAsync(dataA1, stream1, HostToDevice)cudaMemcpyAsync(dataB0, stream0, HostToDevice)cudaMemcpyAsync(dataB1, stream1, HostToDevice)kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)cudaMemcpyAsync(result0, stream0, DeviceToHost)cudaMemcpyAsync(result1, stream1, DeviceToHost)

Page 14: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

14

CUDA Optimizations

▼ Stream Structure Optimized & Modified Processing time: 41.1ms

9/12/12

cudaMemcpyAsync(dataA0, stream0, HostToDevice)cudaMemcpyAsync(dataA1, stream1, HostToDevice)cudaMemcpyAsync(dataB0, stream0, HostToDevice)cudaMemcpyAsync(dataB1, stream1, HostToDevice)kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)cudaMemcpyAsync(result0, stream0, DeviceToHost)kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)cudaMemcpyAsync(result1, stream1, DeviceToHost)

Page 15: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

15

CUDA Optimizations

▼ Stream Structure not always beneficial Overhead could result in performance reduction Profile to determine kernel execution vs. data transfer−NVIDIA Visual Profiler− cudaEventRecord()

9/12/12

Page 16: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

16

GPU Results

9/12/12

ALF Processing Speed (Frames Per Second) vs. Optimization Stage

1, 67.05

4, 85.51

0, 65.14

3, 82.882, 81.74

60.00

65.00

70.00

75.00

80.00

85.00

90.00

0 1 2 3 4 5

Optimization Stage

Pro

cess

ing

Sp

eed

(F

PS

)

FPS

▼ Optimization Stages 0: No Optimizations (65 FPS) 1: Page Locking Memory (67 FPS) 2: Asynchronous GPU calls (81 FPS) 3: Non-optimized Streaming (82 FPS) 4: Optimized Streaming (85 FPS)

Page 17: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

17

GPU Results

▼ ALF CPU vs. GPU Processing

9/12/12

Adaptive Linear Filter FPS Processing Vs. Image Height

624, 92.78

720, 85.51

720, 67.78

624, 77.64

1248, 20.05

1440, 17.07

1872, 8.92 2160, 7.59

2160, 12.91

1872, 15.23

1248, 31.95

1440, 26.13

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

500 700 900 1100 1300 1500 1700 1900 2100 2300

Image Height (4:3 Aspect)

Fra

mes

Per

Sec

on

d

CPU FPS

GPU GPS

Page 18: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

18

Conclusion

▼ Test various thread per block allocations▼ Use page locked memory for data transfers

Asynchronous memory transfer and non-blocking calls▼ Ensure proper coordination of streams

Data Parallelism and Task Parallelism

9/12/12

Page 19: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

19

QUESTIONS?

9/12/12

Page 20: General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific henry.au@navy.mil 09/12/12

20

References

▼ Amdahl, G., "Validity of the single processor approach to achieving large scale computing capabilities." AFIPS Spring Joint Computer Conference, 1976.

▼ CUDA C Best Practices Guide Ver 4.0, 5/2011.▼ Gustafson, J., "Reevaluating Amdahl's Law." Communications of

the ACM, Vol. 31 Number 5, May 1988.▼ Jason Sanders, Edward Kandrot. CUDA By Example, An

Introduction to General-Purpose GPU Programming. Addison-Wesley. Copyright NVIDIA Corporation 2011.

▼ NVIDIA CUDA Programming Guide Ver 4.0, 5/6/2011.▼ Tau-User Guide. Department of Computer and Information

Science, University of Oregon Advanced Computing Laboratory. 2011

9/12/12