exploiting parallelism on gpus

12
Exploiting Parallelism on GPUs Matt Mukerjee David Naylor

Upload: vail

Post on 22-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Exploiting Parallelism on GPUs. Matt Mukerjee David Naylor. Parallelism on GPUs. $100 NVIDIA video card  192 cores (Build Blacklight for ~$2000 ???) Incredibly low power Ubiquitous Question: Use for general computation? General Purpose GPU (GPGPU). ?. =. GPU Hardware. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Parallelism on GPUs

Exploiting Parallelism on GPUs

Matt MukerjeeDavid Naylor

Page 2: Exploiting Parallelism on GPUs

Parallelism on GPUs• $100 NVIDIA video card 192 cores– (Build Blacklight for ~$2000 ???)

• Incredibly low power• Ubiquitous

• Question: Use for general computation?– General Purpose GPU (GPGPU)

=?

Page 3: Exploiting Parallelism on GPUs

GPU Hardware• Very specific constraints– Designed to be SIMD (e.g. shaders)– Zero-overhead thread scheduling– Little caching (compared to CPUs)

• Constantly stalled on memory access• MASSIVE # of threads / core• Much finer-grained threads

(“kernels”)

Page 4: Exploiting Parallelism on GPUs

CUDA Architecture

Page 5: Exploiting Parallelism on GPUs

Thread Blocks• GPUs are SIMD

• How does multithreading work?• Threads that branch are halted, then

run• Single Instruction Multiple….?

Page 6: Exploiting Parallelism on GPUs

CUDA is an SIMT architecture

• Single Instruction Multiple Thread• Threads in a block execute the same

instructionMulti-threadedInstruction Unit

Page 7: Exploiting Parallelism on GPUs

ObservationFitting the data structures needed by the threads in one multiprocessor requires application-specific tuning.

Page 8: Exploiting Parallelism on GPUs

Example: MapReduce on CUDA

Too big forcache on one SM!

Page 9: Exploiting Parallelism on GPUs

ProblemOnly one code branch within a block executes at a time

Page 10: Exploiting Parallelism on GPUs

Enhancing SIMT

Page 11: Exploiting Parallelism on GPUs

ProblemIf two multiprocessors share a cache line, there are more memory accesses than necessary.

Page 12: Exploiting Parallelism on GPUs

Data Reordering