a many-core gpu architecture.. price, performance, and evolution

16
Larrabee A many-core GPU architecture.

Post on 19-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

LarrabeeA many-core GPU architecture.

GPUs vs CPUsPrice, performance, and evolution.

Definitions

CPU (Central Processing Unit) – general purpose processor able to execute computer programs.

GPU (Graphics Processing Unit) - dedicated graphics rendering device.

Price and Performance

The nVIDIA GeForce 6800 Ultra is able to reach a performance of 40 Gflops whereas an Intel 3GHz Pentium4 is able to reach only 6. [1]

What is more impressive, current cards such as ATI HD5870, AMD FireStream 9250, NVIDIA GeForce 9800 run between 1 and 3 TFLOPS.

Reasons for this include highly parallel vector processing, fast onboard memory, and pipeline constraints which stream data without stalls.

Evolution

GPU performance has approximately doubled every 6 months since the mid-1990s.

CPU performance doubles every 18 months on average (Moore’s law).

Current trendsHow we use GPUs.

Alternative applications

New trends are showing GPU use in scientific computing using data-parallel algorithms. Examples include:

Clustering

GPU clustering to simulate the dispersion of airborne contaminants in New York City.

Image Stitching

Fast seamless stitching and tone-mapping of gigapixel images. (~1 hour on a notebook PC)

Molecular Dynamics

Molecular dynamics to evaluate forces between atoms that do not share bonds.

ArchitectureHow it is built.

Key differences

TYPICAL GPU

Ordered sequence of rendering steps. Fixed hardware dedicated to each step.

LARABEE

Runs most of its pipeline in software running on multiple general purpose x86 cores.

This allows the rendering pipeline to be reconfigured dynamically. Hence, we are able to skip steps or allocate extra resources when required.

Larrabee CPU Core

The Larrabee core is “derived” from the Pentium processor.

1 scalar unit for single operations and 1 vector unit for multiple operations.

32KB L1 data and instruction cache.

256 KB L2 cache which share a ring network.

Details

8KB L1 cache is 4 times larger than original Pentium.

This is due to the fact that each core is able to perform four-way multithreading to reduce thread switching overhead. (Not to be confused with simultaneous multithreading.)

The 256KB L2 cache share a ring network. If a core is unable to find data in its own L2 cache, it places a request on a ring bus/network and will eventually find the data in its L2.

Uses a rendering technique called binning, which divides the screen into regions, and renders polygons accordingly.

Benefits of Larrabee

Game physicsReal-time ray tracingImage and video processingPhysical simulationExtended rendering capabilities

References

[1] Zhe Fan, Feng Qiu, Kaufman A., Yoakum-Stover S.  GPU Cluster for High Performance Computing. 2004. ACM / IEEE Supercomputing Conference 2004, November 06-12, Pittsburgh, PA.

[2] L. Seiler et al. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, vl. 27, n. 3, Article 18, August 2008.