gpus and accelerators

9
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li

Upload: dorie

Post on 12-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

GPUs and Accelerators. Jonathan Coens Lawrence Tan Yanlin Li. Outline.  Graphic Processing Units Features Motivation Challenges  Accelerator Methodology Performance Evaluation Discussion  Rigel Methodology Performance Evaluation Discussion  Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPUs and Accelerators

GPUs and Accelerators

Jonathan CoensLawrence Tan

Yanlin Li

Page 2: GPUs and Accelerators

Outline•  Graphic Processing Units

o Featureso Motivationo Challenges

•  Acceleratoro Methodologyo Performance Evaluationo Discussion

•  Rigelo Methodologyo Performance Evaluationo Discussion

•  Conclusion

Page 3: GPUs and Accelerators

  Graphics Processing Units (GPU)• GPU

o Special purpose processors designed to render 3D scenes

o In almost every desktop today• Features

o Highly parallel processorso Better floating point performance than CPUs

ATI Radeon x1900 - 250 Gflops• Motivation

o Use GPUs for general purpose             programming• Challenges

o Difficult for programmer to programo Trade off between programmability 

            and performance GeForce 6600GT (NV43) GPU

Page 4: GPUs and Accelerators

Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses

• Methodologyo Data Parallelism to program GPU (SIMD)o Parallel Array C# Object o No aspects of GPU are exposed to the programmero Programmer only needs to know how to use the Parallel Arrayo Accelerator takes care of the conversion to pixel shader codeo Parallel programs can be represented as DAGs

Simplified block diagram for a GPU Expression DAG with shader breaks marked

Page 5: GPUs and Accelerators

Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses

Performance Evaluation

Performance of Accelerator versus hand coded pixel shader programs on a GeForce 7800 GTX and an ATI x1800. Performance is shown as speedup relative to the C++ version of programs

Speedup of Accelator programs on various GPU compared to C++ programs running on a CPU

Page 6: GPUs and Accelerators

Rigel: 1024-core Accelerator

Specific Architecture• SPMD programming model• Global address space• RISC instruction set• Write-back cache• Cores laid out in clusters of 8, each cluster with local cache • Custom cores (optimized for space / power)

 Hierarchical Task Queueing• Single queue from programmer's perspective• Architecture handles distributing tasks• Customizable via API

o Task granularityo Static vs. dynamic scheduling

Page 7: GPUs and Accelerators

Rigel's PerformanceFairly Successful• Achieved speedup utilizing all 1024 cores• Hierarchical task structure effectively scaled to 1024

     Issues• Cache coherence!

o Memory invalidate broadcasts slow system down  Barrier flags Task enqueue / dequeue variables

o  Not done in hardware... Lazy-evaluation write-through barriers at cluster level

Page 8: GPUs and Accelerators

Improving Rigel

1. Will the hierarchical task structure continue to scale? If not, when will the boundary be? (Think multiple cache levels but with processor tasks)

2. How could we implement barriers or queues to avoid contention, but still scale? (Is memory managed cache coherence appropriate?)

3. Is specialized hardware the way to go (clusters of 8 custom cores), or can this be replaced by general purpose cores?

Page 9: GPUs and Accelerators

Generic and Custom Accelerators

• Difficult to make generic enough programming interface between programmer and multi-core systemo GPUs are limited by SIMD programming modelo Specific hardware platforms still have issues for SPMD

• Efficiently scaling for more cores is still an issue       How do we solve these issues?