trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre
DESCRIPTION
Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre. Graphic Processors (GPUs). Modern computers are graphics intensive Advanced 3D graphics require a significant amount of computation. Graphics Card (Source: nvidia.com). - PowerPoint PPT PresentationTRANSCRIPT
1
Trial Lecture
The Use of GPUs for High-Performance Computing
12. October 2010
Magnus Jahre
2
Graphic Processors (GPUs)
• Modern computers are graphics intensive
• Advanced 3D graphics require a significant amount of computation
Graphics Card (Source: nvidia.com)
Solution: Add a Graphics Processor (GPU)
3
High-Performance Computing
High-Performance Computing (HPC)
General Purpose Programming on GPUs (GPGPU)
Efficient use of computers for computationally intensive problems in science or engineering
Pro
cess
ing
Dem
and
Communication Demand
Weather forecastClimate modeling
Dynamic Molecular Simulation
Computational Computer
Architecture
Office Applications
Third dimension:
Main Memory Capacity
4
Outline
• GPU Evolution
• GPU Programming
• GPU Architecture
• Achieving High GPU Performance
• Future Trends
• Conclusions
5
GPU EVOLUTION
6
First GPUs: Fixed Hardware
[Blythe 2008]
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
7
Programmable Shaders
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
Motivation: More flexible graphics processing
8
GPGPU with Programmable Shaders
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
Use Graphics Library to gain access to GPU
Use color values to code data
Effect of fixed function stages must be accounted for
9
Functional Unit Utilization
Vertex Processing
Fragment Processing
Vertex Processing
Fragment Processing
RasterizationFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
10
Functional Unit Utilization
Vertex Processing
Fragment Processing
Vertex intensive shader
Fragment intensive shader
Unified shader
11
Unified Shader Architecture• Exploit parallelism
– Data parallelism– Task parallelism
• Data parallel processing (SIMD/SIMT)
• Hide memory latencies
• High bandwidth
Architecture naturally supports GPGPU
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
Thread Scheduler
Interconnect
On-Chip Memory or Cache
Off-Chip DRAM Memory
12
GPU PROGRAMMING
13
Programmable Shaders Unified Shaders
GPGPU Tool Support
Sh
PeakStreamAccelerator
GPU++
CUDA
OpenCL
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
3
1
GPU papers on Supercomputing
14
Compute Unified Device Architecture (CUDA)• Most code is normal C+
+ code
• Code to run on GPU organized in kernels
• CPU sets up and manages computation
__global__ void vector_add(float* a, float* b, float* c){ int idx = threadIdx.x; c[idx] = a[idx] + b[idx];}
int main(){ int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ...}
15
Thread/Data Organization• Hierarchical thread
organization– Grid– Block– Thread
• A block can have a maximum of 512 threads
• 1D, 2D and 3D mappings possible
Block (0,0)
Block (0,1)
Block (0,2)
Block (1,0)
Block (1,1)
Block (1,2)
Grid
Block (0)
Block (1)
Grid
16
C
B
A A
B
Global Memory Main MemoryGPU CPU
SP SP SP SP
SP SP SP SP
A
B
CC
Local Memory
Vector Addition Example
A collection of concurrently processed threads is called a warp
17
Terminology: Warp
18
Vector Addition Profile
• Only 11% of GPU time is used to add vectors
• The arithmetic intensity of the problem is too low
• Overlapping data copy and computation could help
11%
58%
32%
%GPU time
vector_add memcpyHtoDmemcpyDtoH
Hardware: NVIDIA MVS 3100M
19
Will GPUs Save the World?
• Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010]
• GPGPU has provided nice speedups for problems that fit the architecture
• Metric challenge: The practitioner needs performance per developer hour
20
GPU ARCHITECTURE
21
NVIDIA Tesla Architecture
Figure reproduced from [Lindholm et al.; 2008]
22
Control Flow
• The threads in a warp share use the same instruction
• Branching is efficient if all threads in a warp branch in the same direction
• Divergent branches within a warp cause serial execution of both paths
IF
Condition True Threads
Condition False Threads
Condition True Threads
Condition False Threads
23
Modern DRAM Interfaces
• Maximize bandwidth with 3D organization
• Repeated requests to the row buffer are very efficient
Row address
Column address
DRAM
Banks
Row Buffer
Rows
Co
lum
ns
24
Access Coalescing
• Global memory accesses from all threads in a half-warp are combined into a single memory transaction
• All memory elements in a segment are accessed
• Segment size can be halved if only the lower or upper half is used
Assumes Compute Capability 1.2 or higher
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Addr 128
Addr 132
Addr 136
Addr 140
Addr 144
Addr 148
Addr 152
Addr 156
Addr 124
Addr 120
Addr 116
Addr 112Tran
saction
Transactio
n
25
Bank Conflicts
• Memory banks can service requests independently
• Bank conflict: more than one thread access a bank concurrently
• Strided access patterns can cause bank conflicts
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Stride two accesses gives 2-way bank conflict
26
NVIDIA Fermi• Next generation computing
chip from NVIDIA
• Aims to alleviate important bottlenecks– Improved double precision
floating point support– Cache hierarchy– Concurrent kernel execution
• More problems can be solved efficiently on a GPU
Figure reproduced from [NVIDIA; 2010]
27
ACHIEVING HIGH GPU PERFORMANCE
28
Which problems fit the GPU model?
• Fine-grained data parallelism available• Sufficient arithmetic intensity• Sufficiently regular data access patterns
It’s all about organizing data
Optimized memory system use enables high performance
29
Increase Computational Intensity• Memory types:
– On-chip shared memory: Small and fast
– Off-chip global memory: Large and slow
• Technique: Tiling– Choose tile size such
that it fits in the shared memory
– Increases locality by reducing reuse distance
A × B = C
×
=
Reuse!
Reuse!
30
Memory Layout
• Exploit coalescing to achieve high bandwidth
• Linear access necessary
• Solution: Tiling
A × B = C
×
=
Assume row-major storage
Coalesced Not Coalesced
31
W1 W2 W3 W4W1 W2 W3 W4
Avoid Branching Inside Warps
Assume 2 threads per warp
All iterations diverge
8
4 4
One iteration diverges
8
11 1 1 1 1 1 1 1
2 2 2 2
44
1 1 1 1 1 1 1 1
2 2 2 2
32
Automation
• Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08]– Avoid saturation– Sweet spot will vary between devices– Sweet spot varies with problem sizes
• Auto-tuning 3D FFT [Nukada et al.; SC2009]– Balance resource consumption vs. parallelism with kernel radix and
ordering – Best number of thread blocks chosen automatically– Inserts padding to avoid shared memory bank conflicts
33
Case Study: Dynamic Molecular Simulation with NAMD
Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]
34
Key Performance Enablers
• Careful division of labor between GPU and CPU– GPU: Short range non-bonded forces– CPU: Long-range electrostatic forces and coordinate updates
• Overlap CPU and GPU execution through asynchronous kernel execution
• Use event recording to track progress in asynchronously executing streams
[Phillips et al., SC2008]
35
CPU/GPU Cooperation in NAMD
[Phillips et al., SC2008]
CPU
GPU
Remote Local Local Update
Remote Local
Time
ff
f f
x
x x
36
Challenges
• Completely restructuring legacy software systems is prohibitive
• Batch processing software are unaware of GPUs
• Interoperability issues with pinning main memory pages for DMA
[Phillips et al., SC2008]
37
FUTURE TRENDS
38
Accelerator Integration• Industry move towards integrating
CPUs and GPUs on the same chip– AMD Fusion [Brookwood; 2010]– Intel Sandy Bridge (fixed function
GPU)
• Are other accelerators appropriate?– Single-chip Heterogeneous
Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010]
AMD FusionReproduced from [Brookwood; 2010]
39
Vector Addition Revisited
Start-up and shut-down data transfers are the main bottleneck
Fusion eliminates these overheads by storing values in the on-chip cache
Using accelerators becomes more feasible
40
Memory System Scalability
• Current CPU bottlenecks:– Number of pins on a chip grows slowly– Off-chip bandwidth grows slowly
• Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand
• Conflicting requirements:– GPU: High bandwidth, not latency sensitive– CPU: High bandwidth, can be latency sensitive
41
CONCLUSIONS
42
Conclusions
• GPUs can offer a significant speedup for problems that fit the model
• Tool support and flexible architectures increases the number of problems that fit the model
• CPU/GPU on-chip integration can reduce GPU start-up overheads
43
Thank You
Visit our website:http://research.idi.ntnu.no/multicore/
44
References• Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on
CPU and GPU; Lee et al.; ISCA; 2010• Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010• NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA;
2010• AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White
Paper; AMD; 2010• Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master
Thesis; NTNU; 2010• Complexity Effective Memory Access Scheduling for Many-Core Accelerator
Architectures; Yuan et al.; MICRO; 2009• Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009• Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009• Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips
et al.; SC; 2008• Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008• NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE
Micro; 2008• Optimization Principles and Application Performance Evaluation of a Multithreaded
GPU using CUDA; Ryoo et al.; PPoPP; 2008
45
EXTRA SLIDES
46
Complexity-Effective Memory Access Scheduling• On-chip interconnect
may interleave requests from different thread processors
• Row locality is destroyed
• Solution: Order-preserving interconnect arbitration policy and in-order scheduling
[Lee et al., MICRO2009]
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Row Switch
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Req 1Row B
Row Switch
Req 1Row B
Time
In-order Scheduling
Out-of-order Scheduling
Queue:
Req 0Row A
Req 0Row B
Req 1Row A
Req 1Row B
Req 1Row A
Req 0Row B
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Req 1Row B
Performance of out-of-order scheduling with less complex in-order scheduling