gpus dp accelerators msc

Upload: dsima

Post on 10-Apr-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 GPUs DP Accelerators MSC

    1/166

    GPGPUs-Data Parallel Accelerators

    Dezs Sima

    Oct. 20. 2009

    Dezs Sima 2009Ver. 1.0

  • 8/8/2019 GPUs DP Accelerators MSC

    2/166

    2. Basics of the SIMT execution

    Contents

    1.Introduction

    3. Overview of GPGPUs

    4. Overview of data parallel accelerators

    5. Microarchitecture of GPGPUs (examples)

    5.1 AMD/ATI RV870 (Cypress)

    5.2 Nvidia Fermi

    6. References

    5.3 Intels Larrabee

  • 8/8/2019 GPUs DP Accelerators MSC

    3/166

    1. The emergence of GPGPUs

  • 8/8/2019 GPUs DP Accelerators MSC

    4/166

    Vertex

    Edge Surface

    Vertices

    have three spatial coordinates supplementary information necessary to render the object, such as

    color texture

    reflectance properties etc.

    Representation of objects by triangels

    1. Introduction (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    5/166

    Main types of shaders in GPUs

    Shaders

    Geometry shadersVertex shaders Pixel shaders(Fragment shaders)

    Transform each vertexs3D-position in the virtual space

    to the 2D coordinate,at which it appears on the screen

    Calculate the colorof the pixels

    Can add or removevertices from a mesh

    1. Introduction (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    6/166

    DirectX version Pixel SM Vertex SM Supporting OS

    8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

    8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003

    9.0 (12/2002) 2.0 2.0

    9.0a (3/2003) 2_A, 2_B 2.x

    9.0c (8/2004) 3.0 3.0 Windows XP SP2

    10.0 (11/2006) 4.0 4.0 Windows Vista

    10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008

    11 (in development) 5.0 5.0

    Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]

    1. Introduction (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    7/166

    Convergence of important features of the vertex and pixel shader models

    Subsequent shader models introduce typically, a number of new/enhanced features.

    Shader model 2 [19]

    Different precision requirements

    Vertex shader: FP32 (coordinates)Pixel shader: FX24 (3 colors x 8)

    Different instructions

    Different resources (e.g. registers)

    Differences between the vertex and pixel shader models in subsequent shader modelsconcerning precision requirements, instruction sets and programming resources.

    Shader model 3 [19]

    Unified precision requirements for both shaders (FP32)with the option to specify partial precision (FP16 or FP24)by adding a modifier to the shader code

    Different instructions

    Different resources (e.g. registers)

    1. Introduction (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    8/166

    Shader model 4 (introduced with DirectX10) [20]

    Unified precision requirements for both shaders (FP32)

    with the possibility to use new data formats. Unified instruction set

    Unified resources (e.g. temporary and constant registers)

    Shader architectures of GPUs prior to SM4

    GPUs prior to SM4 (DirectX 10):have separate vertex and pixel units with different features.

    Drawback of having separate units for vertex and pixel shading

    Inefficiency of the hardware implementation

    (Vertex shaders and pixel shaders often have complementary load patterns [21]).

    1. Introduction (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    9/166

    DirectX version Pixel SM Vertex SM Supporting OS

    8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

    8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003

    9.0 (12/2002) 2.0 2.0

    9.0a (3/2003) 2_A, 2_B 2.x

    9.0c (8/2004) 3.0 3.0 Windows XP SP2

    10.0 (11/2006) 4.0 4.0 Windows Vista

    10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008

    11 (in development) 5.0 5.0

    Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]

    1. Introduction (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    10/166

    Unified shader model (introduced in the SM 4.0 of DirectX 10.0)

    The same (programmable) processor can be used to implement all shaders;

    the vertex shader

    the pixel shader and

    the geometry shader (new feature of the SMl 4)

    Unified, programable shader architecture

    1. Introduction (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    11/166

    Figure: Principle of the unified shader architecture [22]

    1. Introduction (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    12/166

    Based on its FP32 computing capability and the large number of FP-units available

    the unified shader is a prospective candidate for speeding up HPC!

    GPUs with unified shader architectures also termed as

    GPGPUs

    (General Purpose GPUs)

    1. Introduction (9)

    or

    cGPUs(computational GPUs)

  • 8/8/2019 GPUs DP Accelerators MSC

    13/166

    Figure: Peak SP FP performance of Nvidias GPUs vs Intel P4 and Core2 processors [11]

    1. Introduction (10)

    1 I d i (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    14/166

    Figure: Bandwidth values of Nvidias GPUs vs Intels P4 and Core2 processors [11]

    1. Introduction (11)

    1 I t d ti (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    15/166

    Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]

    1. Introduction (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    16/166

    2. Basics of the SIMT execution

    2 B i f th SIMT ti (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    17/166

    Main alternatives of data parallel execution

    Data parallel execution

    SIMD execution SIMT execution

    One dimensional data parallel execution,

    i.e. it performs the same operation

    on all elements of givenFX/FP input vectors

    Two dimensional data parallel execution,

    i.e. it performs the same operation

    on all elements of givenFX/FP input arrays (matrices)

    E.g. 2. and 3. generationsuperscalars

    GPGPUs,data parallel accelerators

    Figure: Main alternatives of data parallel execution

    data dependent flow control as well as

    barrier synchronization

    is massively multithreaded,

    and provides

    Needs an FX/FP SIMD extensionof the ISA

    Needs an FX/FP SIMT extensionof the ISA and the API

    2. Basics of the SIMT execution (1)

    2 B i f th SIMT ti (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    18/166

    Scalar execution SIMD execution SIMT execution

    Domain of execution:single data elements

    Domain of execution:elements of vectors

    Domain of execution:elements of matrices

    (at the programming level)

    Figure: Domains of execution in case of scalar, SIMD and SIMT execution

    2. Basics of the SIMT execution (2)

    Remark

    SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

    Scalar, SIMD and SIMT execution

    2 Basics of the SIMT execution (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    19/166

    Key components of the implementation of SIMT execution

    Data parallel execution Massive multithreading

    Data dependent flow control

    Barrier synchronization

    2. Basics of the SIMT execution (3)

    2 Basics of the SIMT execution (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    20/166

    (i.e. all ALUs of a SIMT core perform typically the same operation).

    Data parallel execution

    Fetch/Decode

    ALU ALU ALUALU

    SIMT core

    Figure: Basic layout of a SIMT core

    ALU ALU ALUALU

    Performed by SIMT cores

    SIMT coresexecute the same instruction stream on a number ofALUs

    SIMT cores are the basic building blocks ofGPGPU or data parallel accelerators.

    2. Basics of the SIMT execution (4)

    During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.

    2 Basics of the SIMT execution (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    21/166

    streaming multiprocessor (Nvidia),

    superscalar shader processor (AMD),

    wide SIMD processor, CPU core (Intel).

    Remark 1

    Different manufacturers designate SIMT cores differently, such as

    2. Basics of the SIMT execution (5)

    2 Basics of the SIMT execution (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    22/166

    Fetch/Decode

    ALU ALU ALUALU

    RF RF RF RF

    Each ALU is allocated a working register set (RF)

    Figure: Main functional blocks of a SIMT core

    ALU ALU ALUALU

    RFRFRFRF

    2. Basics of the SIMT execution (6)

    2 Basics of the SIMT execution (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    23/166

    SIMT ALUs perform typically, RRR operations, that is

    ALUs take their operands from and write the calculated results to the register set

    (RF) allocated to them.

    ALU

    RF

    Figure: Principle of operation of the SIMD ALUs

    2. Basics of the SIMT execution (7)

    2 Basics of the SIMT execution (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    24/166

    Remark 2

    ALU

    RF RF RF RF RF RF RF RF

    ALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

    Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs

    Actually, the register sets (RF) allocated to each ALU are given parts of alarge enough register file.

    2. Basics of the SIMT execution (8)

    2 Basics of the SIMT execution (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    25/166

    Basic operation of recent SIMT ALUs

    ALU

    RF

    are pipelined,capable of starting a new operation every new clock cycle,(more precisely, every shader clock cycle),

    execute basically SP FP-MADD(simple precision i.e. 32-bit.Multiply-Add) instructions of the form axb+c ,

    need a few number of clock cycles, e.g. 2 or 4 shader cycles,to present the results of the SP FMADD operations to the RF,

    That is, without further enhancements

    their peak performance is 2 SP FP operations/cycle

    2. Basics of the SIMT execution (9)

    2 Basics of the SIMT execution (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    26/166

    Additional operations provided by SIMT ALUs

    FX operationsand FX/FP conversions, DP FP operations,

    trigonometric functions (usually supported by special functional units).

    2. Basics of the SIMT execution (10)

    2 Basics of the SIMT execution (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    27/166

    Aim of massive multithreadingto speed up computations by increasing the utilization of available computing resources

    in case of stalls (e.g. due to cache misses).

    2. Basics of the SIMT execution (11)

    Massive multithreading

    Suspend stalled threads from execution and allocate ready to run threads for execution.

    When a large enough number of threads are available long stalls can be hidden.

    Principle

    2 Basics of the SIMT execution (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    28/166

    Multithreading is implemented by

    creating and managing parallel executable threadsfor each data elementof the

    execution domain.

    Figure: Parallel executable threads for each element of the execution domain

    Same instructionsfor all data elements

    2. Basics of the SIMT execution (12)

    2 Basics of the SIMT execution (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    29/166

    Effective implementation of multithreading

    if thread switches, called context switches, do not cause cycle penalties.

    providing separate contexts (register space) for each thread, and

    implementing a zero-cycle context switch mechanism.

    Achieved by

    2. Basics of the SIMT execution (13)

    2. Basics of the SIMT execution (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    30/166

    ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

    CTX CTX CTX CTX CTX CTX CTXCTX

    CTX CTX CTX CTX CTX CTX CTXCTX

    CTX CTX CTX CTX CTX CTX CTXCTX

    CTX CTX CTX CTX CTX CTX CTXCTX

    CTX CTX CTX CTX CTX CTX CTXCTX

    CTX CTX CTX CTX CTX CTX CTXCTX

    Actual context Register file (RF)

    Context switch

    Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU

    Fetch/Decode

    SIMT core

    2. Basics of the SIMT execution (14)

    2. Basics of the SIMT execution (15)

  • 8/8/2019 GPUs DP Accelerators MSC

    31/166

    Data dependent flow control

    Implemented by SIMT branch processing

    In SIMT processing both paths of a branch are executed subsequently such that

    for each path the prescribed operations are executed only on those data elements whichfulfill the data condition given for that path (e.g. xi > 0).

    Example

    2. Basics of the SIMT execution (15)

    2. Basics of the SIMT execution (16)

  • 8/8/2019 GPUs DP Accelerators MSC

    32/166

    Figure: Execution of branches [24]

    The given condition will be checked separately for each thread

    2. Basics of the SIMT execution (16)

    2. Basics of the SIMT execution (17)

  • 8/8/2019 GPUs DP Accelerators MSC

    33/166

    Figure: Execution of branches [24]

    First all ALUs meeting the condition execute the prescibed three operations,then all ALUs missing the condition execute the next two operatons

    2. Basics of the SIMT execution (17)

    2. Basics of the SIMT execution (18)

  • 8/8/2019 GPUs DP Accelerators MSC

    34/166

    Figure: Resuming instruction stream processing after executing a branch [24]

    ( )

    2. Basics of the SIMT execution (19)

  • 8/8/2019 GPUs DP Accelerators MSC

    35/166

    Barrier synchronization

    Implemented e.g. in AMDs Intermediate Language (IL) by the fence threads instruction [10].

    In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow(CF) instruction format [7].

    Remark

    Lets wait all threads for completing all prior instructions before executing the next instruction.

    ( )

    2. Basics of the SIMT execution (20)

  • 8/8/2019 GPUs DP Accelerators MSC

    36/166

    Each kernel invocationlets execute all

    thread blocks (Block(i,j))kernel0()

    kernel1()

    Host Device

    Figure: Hierarchy ofthreads [25]

    Principle of SIMT execution

    ( )

  • 8/8/2019 GPUs DP Accelerators MSC

    37/166

    3. Overview of GPGPUs

    3. Overview of GPGPUs (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    38/166

    Basic implementation alternatives of the SIMT execution

    GPGPUs Data parallel accelerators

    Dedicated unitssupporting data parallel execution

    with appropriateprogramming environment

    Programmable GPUswith appropriate

    programming environments

    E.g. Nvidias 8800 and GTX linesAMDs HD 38xx, HD48xx lines

    Nvidias Tesla linesAMDs FireStream lines

    Have display outputs No display outputsHave larger memories

    than GPGPUs

    Figure: Basic implementation alternatives of the SIMT execution

    ( )

    3. Overview of GPGPUs (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    39/166

    GPGPUs

    Nvidias line AMD/ATIs line

    Figure: Overview of Nvidias and AMD/ATIs GPGPU lines

    90 nm G80

    65 nm G92 G200

    Shrink Enhancedarch.

    80 nm R600

    55 nm RV670 RV770

    Shrink Enhancedarch.

    40 nm Fermi

    Shrink

    40 nm RV870

    ShrinkEnhanced

    arch.

    Enhancedarch.

  • 8/8/2019 GPUs DP Accelerators MSC

    40/166

    3. Overview of GPGPUs (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    41/166

    8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280

    Core G80 G80 G92 GT200 GT200

    Introduction 11/06 11/06 10/07 6/08 6/08

    IC technology 90 nm 90 nm 65 nm 65 nm 65 nm

    Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs

    Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2

    Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz

    Computation

    No.of ALUs 96 128 112 192 240

    Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz

    No. FP32 inst./cycle 3* (but only in a few issue cases) 3 3

    Peak FP32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS

    Peak FP64 performance 77/76 GLOPS

    Memory

    Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s

    Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit

    Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s

    Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB

    Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3

    Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit

    Mem. contr. Crossbar Crossbar Crossbar Crossbar Crossbar

    System

    Multi. CPU techn. SLI SLI SLI SLI SLI

    Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

    MS Direct X 10 10 10 10.1 subset 10.1 subset

    Table: Main features of Nvidias GPGPUs

    3. Overview of GPGPUs (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    42/166

    HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

    Core R600 R670 R670 RV770 RV770

    Introduction 5/07 11/07 11/07 5/08 5/08

    IC technology 80 nm 55 nm 55 nm 55 nm 55 nm

    Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs

    Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2

    Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

    Computation

    No. of ALUs 320 320 320 800 800

    Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

    No. FP32 inst./cycle 2 2 2 2 2

    Peak FP32 performance 471.6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS

    Peak FP64 performance 200 GLOPS 240 GLOPS

    Memory

    Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)

    Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit

    Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s

    Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB

    Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5

    Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit

    Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar

    System

    Multi. CPU techn. CrossFire CrossFire X CrossFire X CrossFire X CrossFire X

    Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

    MS Direct X 10 10.1 10.1 10.1 10.1

    Table: Main features of AMD/ATIs GPGPUs

    3. Overview of GPGPUs (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    43/166

    Price relations (as of 10/2008)

    Nvidia

    GTX260 ~ 300 $

    GTX280 ~ 600 $

    AMD/ATI

    HD4850 ~ 200 $

    HD4870 na

  • 8/8/2019 GPUs DP Accelerators MSC

    44/166

    4. Overview of data parallel accelerators

    4. Overview of data parallel accelerators (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    45/166

    Implementation alternatives of data parallel accelerators

    On-dieintegration

    On cardimplementation

    Recentimplementations

    Futureimplementations

    E.g. GPU cards

    Data-parallelaccelerator cards

    Intels Heavendahl

    AMDs Torrenzaintegration technology

    AMDs Fusionintegration technology

    Trend

    Figure: Implementation alternatives of dedicated data parallel accelerators

    Data parallel accelerators

    4. Overview of data parallel accelerators (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    46/166

    On-card accelerators

    1U serverimplementations

    Cardimplementations

    Desktopimplementations

    Usually dual cardsmounted into a box,

    connected to anadapter card

    that is inserted into afree PCI-E x16 slotof thehost PC through a cable.

    E.g. Nvidia Tesla D870 Nvidia Tesla S870

    Nvidia Tesla S1070AMD FireStream 9250

    Nvidia Tesla C870

    Nvidia Tesla C1060AMD FireStream 9170

    Usually 4 cardsmounted into a 1U server rack,

    connected two adapter cardsthat are inserted into

    two free PCIEx16 slots of a serverthrough two switches

    and two cables.

    Single cards fitting

    into a free PCI Ex16 slotof the host computer.

    Figure:Implementation alternatives of on-card accelerators

    4. Overview of data parallel accelerators (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    47/166

    Figure: Main functional units of Nvidias Tesla C870 card [2]

    FB: Frame Buffer

    4. Overview of data parallel accelerators (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    48/166

    Figure: Nvidas Tesla C870 andAMDs FireStream 9170 cards [2], [3]

    4. Overview of data parallel accelerators (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    49/166

    Figure: Tesla D870 desktop implementation [4]

    4. Overview of data parallel accelerators (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    50/166

    Figure: Nvidias Tesla D870 desktop implementation [4]

    4. Overview of data parallel accelerators (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    51/166

    Figure: PCI-E x16 host adapter card of Nvidias Tesla D870 desktop [4]

    4. Overview of data parallel accelerators (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    52/166

    Figure: Concept of Nvidias Tesla S870 1U rack server [5]

    4. Overview of data parallel accelerators (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    53/166

    Figure: Internal layout of Nvidias Tesla S870 1U rack [6]

    4. Overview of data parallel accelerators (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    54/166

    Figure: Connection cable between Nvidias Tesla S870 1U rack and the adapter cardsinserted into PCI-E x16 slots of the host server [6]

    4. Overview of data parallel accelerators (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    55/166

    6/08

    GT200-based4 GB GDDR30.936 GLOPS

    6/07

    G80-based1.5 GB GDDR30.519 GLOPS

    Card

    Desktop

    IU Server

    C870

    2007 2008

    C1060

    CUDA

    NVidia Tesla

    6/07

    G80-based2*C870 incl.3 GB GDDR31.037 GLOPS

    D870

    6/07

    G80-based4*C870 incl.6 GB GDDR32.074 GLOPS

    S870

    6/07

    Version 1.0

    6/08

    GT200-based4*C1060

    16 GB GDDR33.744 GLOPS

    S1070

    11/07

    Version 1.01

    6/08

    Version 2.0

    Figure: Overview of Nvidias Tesla family

    4. Overview of data parallel accelerators (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    56/166

    6/08

    Shipped

    11/07

    RV670-based2 GB GDDR3

    500 GLOPS FP32~200 GLOPS FP64

    Card

    Stream ComputingSDK

    9170

    2007 2008

    9170

    Rapid Mind

    AMD FireStream

    6/08

    RV770-based1 GB GDDR31 TLOPS FP32

    ~300 GFLOPS FP64

    9250

    12/07

    Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

    Version 1.0

    10/08

    Shipped

    9250

    Figure: Overview of AMD/ATIs FireStream family

    4. Overview of data parallel accelerators (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    57/166

    Nvidia Tesla cards AMD FireStream cards

    Core type C870 C1060 9170 9250

    Based on G80 GT200 RV670 RV770

    Introduction 6/07 6/08 11/07 6/08

    Core

    Core frequency 600 MHz 602 MHz 800 MHz 625 MHz

    ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ

    No. of ALUs 128 240 320 800

    Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS

    Peak FP64 performance ~200 GLOPS ~250 GLOPS

    Memory

    Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s

    Mem. interface 384-bit 512-bit 256-bit 256-bit

    Mem. bandwidth 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s

    Mem. size 1.5 GB 4 GB 2 GB 1 GB

    Mem. type GDDR3 GDDR3 GDDR3 GDDR3

    System

    Interface PCI-E x16 PCI-E 2.0x16 PCI-E 2.0x16 PCI-E 2.0x16

    Power (max) 171 W 200 W 150 W 150 W

    Table: Main features of Nvidias and AMD/ATIs data parallel accelerator cards

    4. Overview of data parallel accelerators (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    58/166

    Price relations (as of 10/2008)

    Nvidia Tesla

    C870 ~ 1500 $

    D870 ~ 5000 $

    S870 ~ 7500 $

    C1060 ~ 1600 $

    S1070 ~ 8000 $

    AMD/ATI FireStream

    9170 ~ 800 $ 9250 ~ 800 $

  • 8/8/2019 GPUs DP Accelerators MSC

    59/166

    5. Microarchitecture of GPGPUs (examples)

    5.1 AMD/ATI RV870 (Cypress)

    5.2 Nvidia Fermi

    5.3 Intels Larrabee

  • 8/8/2019 GPUs DP Accelerators MSC

    60/166

    5.1 AMD/ATI RV870

    5.1 AMD/ATI RV870 (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    61/166

    OpenCL 1.0 compliant

    AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

    Introduction: Sept. 22 2009Availability: now

    Performance figures:

    SP FP performance: 2.72 TFLOPS

    DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

    5.1 AMD/ATI RV870 (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    62/166

    Radeon series/5800

    ATI Radeon HD 4870 ATI Radeon HD

    5850

    ATI Radeon HD

    5870

    Manufacturing Process 55-nm 40-nm 40-nm

    # of Transistors 956 million 2.15 billion 2.15 billion

    Core Clock Speed 750MHz 725MHz 850MHz

    # of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

    Memory Type GDDR5 GDDR5 GDDR5

    Memory Clock 900MHz 1000MHz 1200MHz

    Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps

    Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec

    Max Board Power 160W 170W 188W

    Idle Board Power 90W 27W 27W

    Figure: Radeon Series/5800 [42]

    5.1 AMD/ATI RV870 (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    63/166

    Radeon 4800 series/5800 series comparison

    ATI Radeon HD 4870 ATI Radeon HD

    5850

    ATI Radeon HD

    5870

    Manufacturing Process 55-nm 40-nm 40-nm

    # of Transistors 956 million 2.15 billion 2.15 billion

    Core Clock Speed 750MHz 725MHz 850MHz

    # of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

    Memory Type GDDR5 GDDR5 GDDR5

    Memory Clock 900MHz 1000MHz 1200MHz

    Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps

    Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec

    Max Board Power 160W 170W 188W

    Idle Board Power 90W 27W 27W

    Figure: Radeon Series/5800 [42]

    5.1 AMD/ATI RV870 (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    64/166

    8x32 = 256 bitGDDR5

    153.6 GB/s

    1600 EUs(Stream processing units)

    Architecture overview

    20 cores

    16 ALUs/core

    5 EUs/ALU

    Figure: Architectureoverview [42]

    5.1 AMD/ATI RV870 (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    65/166

    The 5870 card

    Figure: The 5870 card [41]

  • 8/8/2019 GPUs DP Accelerators MSC

    66/166

    5.2 Nvidia Fermi

    5.2 Nvidia Fermi (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    67/166

    NVidias Fermi

    Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010

    5.2 Nvidia Fermi (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    68/166

    NVidia: 16 cores(Streaming Multiprocessors)

    6x Dual Channel GDDR5(384 bit)

    Fermis overall structure

    Each core: 32 ALUs

    Figure: Fermis overall structure [40]

    5.2 Nvidia Fermi (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    69/166

    Cuda core(ALU)

    1 SM includes 32 ALUs

    called Cuda cores by NVidia)

    Layout of a core (SM)

    Figure: Layout of a core [40]

    5.2 Nvidia Fermi (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    70/166

    A single ALU (Cuda core)

    SP FP:32-bit FX: 32-bit

    Needs 2 clock cycles

    DP FP performance: of SP FP performance!!

    DP FP

    IEEE 754-2008-compliant

    Figure: A single ALU [40]

    5.2 Nvidia Fermi (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    71/166

    Fermis system architecture

    Figure: Fermis system architecture [39]

    5.2 Nvidia Fermi (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    72/166

    Contrasting Fermi and GT 200

    Figure: Contrasting Fermi and GT 200 [39]

    5.2 Nvidia Fermi (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    73/166

    Each kernel invocationexecutes a grid of

    thread blocks (Block(i,j))kernel0()

    kernel1()

    Host Device

    Figure: Hierarchy ofthreads [25]

    The execution of programs utilizing GP/GPUs

    5.2 Nvidia Fermi (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    74/166

    Global scheduling in Fermi

    Figure: Global scheduling in Fermi [39]

    5.2 Nvidia Fermi (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    75/166

    Microarchitecture of a Fermi core

    5.2 Nvidia Fermi (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    76/166

    Principle of operation of the G80/G92/Fermi GPGPUs

    5.2 Nvidia Fermi (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    77/166

    Work scheduling

    Scheduling thread blocks for execution

    Segmenting thread blocks into warps

    Scheduling warps for execution

    Principle of operation of the G80/G92 GPGPUs

    The key point of operation is work scheduling

    CUDA Th d Bl k

    5.2 Nvidia Fermi (12)

    Th ead sched ling in NVidias GPGPUs

  • 8/8/2019 GPUs DP Accelerators MSC

    78/166

    CUDA Thread Block All threads in a block execute the same

    kernel program (SPMD)

    Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads

    Threads have thread id numbers within

    block Thread program uses thread id to selectwork and address shared data

    Threads in the same block share data andsynchronize while doing their share of the

    work Threads in different blocks cannot

    cooperate Each block can execute in any order

    relative to other blocs!

    CUDA Thread Block

    Thread Id #:

    0 1 2 3 m

    Thread program

    Courtesy: John Nickolls,NVIDIA

    linois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2

    Thread scheduling in NVidias GPGPUs

    5.2 Nvidia Fermi (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    79/166

    t0 t1 t2 tm

    Texture L1

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    TF

    L2

    Memory

    t0 t1 t2 tm

    BlocksBlocks

    SM0 SM1

    TPC

    Figure: Assigning thread blocksto streaming multiprocessors (SM) for execution [12]

    Up to 8 blocks can be assignedto an SM for execution

    Scheduling thread blocks for execution

    TPC: Thread Processing Cluster(Texture Processing Cluster)

    A TPC has

    2 SMs in the G80/G923 SMs in the G200

    A device may run thread blocks sequentiallyor even in parallel, if it has enough resources

    for this, or usually by a combination of both.

    5.2 Nvidia Fermi (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    80/166

    t0 t1 t2 t31

    t0 t1 t2 t31

    Block 1 Warps Block 2 Warps

    SP

    SP

    SP

    SP

    SFU

    SP

    SP

    SP

    SP

    SFU

    Instruction Fetch/Dispatch

    Instruction L1 Data L1

    Streaming Multiprocessor

    Shared Memory

    Segmenting thread blocks into warps

    Threads are scheduled for execution in groups

    of 32 threads, called the warps.

    For scheduling each thread block is subdividedinto warps.

    At any point of time up to 24 warps can bemaintained by the scheduler.

    Figure: Segmenting thread blocks in warps [12]

    Remark

    The number of threads constituting a warpis an implementation decision and notpart of the CUDA programming model.

    5.2 Nvidia Fermi (15)

  • 8/8/2019 GPUs DP Accelerators MSC

    81/166

    Scheduling warps for execution

    warp 8 instruction 11

    SM multithreadedWarp scheduler

    warp 1 instruction 42

    warp 3 instruction 95

    warp 8 instruction 12

    ...

    time

    warp 3 instruction 96Figure: Scheduling warps for execution [12]

    The warp scheduler is a zero-overhead scheduler

    Only those warps are eligible for executionwhose next instruction has all operands available.

    Eligible warps are scheduled

    coarse grained (not indicated in the figure) priority based.

    All threads in a warp execute the same instructionwhen selected.

    4 clock cycles are needed to dispatch the sameinstruction to all threads in the warp (G80)

  • 8/8/2019 GPUs DP Accelerators MSC

    82/166

    5.3 Intels Larrabee

    5.3 Intels Larrabee (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    83/166

    Larrabee

    Part of Intels Tera-Scale Initiative.

    Project started ~ 2005

    First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]

    First official public presentations: in 2008 (e.g. at SIGGRAPH [27])

    Due in ~ 2009

    Performance (targeted):

    2 TFlops

    Brief history:

    Objectives:

    Not a single product but a base architecture for a number of different products.

    High end graphics processing, HPC

    5.3 Intels Larrabee (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    84/166

    NI: New Instructions

    Figure: Positioning of Larrabeein Intels product portfolio [28]

    5.2 Intels Larrabee (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    85/166

    Figure: First public presentation of Larrabee at IDF Fall 2007 [29]

    5.3 Intels Larrabee (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    86/166

    Figure: Block diagram of the Larrabee [30]

    Basic architecture

    Cores: In order x86 IA cores augmented with new instructions

    L2 cache: fully coherent

    Ring bus: 1024 bits wide

    5.3 Intels Larrabee (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    87/166

    Figure: Block diagram of Larrabees cores [31]

    5.3 Intels Larrabee (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    88/166

    Larrabee microarchitecture [27]

    Derived from that of the Pentiums in order design

    5.3 Intels Larrabee (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    89/166

    Figure: The anchestor ofLarrabees cores [28]

    64-bit instructions

    4-way multithreaded(with 4 register sets)

    addition of a 16-wide(16x32-bit) VU

    increased L1 caches(32 KB vs 8 KB)

    access to its 256 KBlocal subset of acoherent L2 cache

    ring network to access

    the coherent L2 $and allow interproc.communication.

    Main extensions

    5.3 Intels Larrabee (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    90/166

    New instructions allow explicit cache control

    the L2 cache can be used as a scratchpad memory while remaining fullycoherent.

    to prefetch data into the L1 and L2 caches

    to control the eviction of cache lines by reducing their priority.

    5.3 Intels Larrabee (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    91/166

    The Scalar Unit

    supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)

    bit count

    bit scan (it finds the next bit set within a register).

    provides new instructions, e.g. for

    5.3 Intels Larrabee (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    92/166

    Figure: Block diagram of the Vector Unit [31]

    The Vector Unit

    VU scatter-gather instructions

    (load a VU vector register from16 non-contiguous data locationsfrom anywhere from the

    on die L1 cache without penalty,or store a VU register similarly).

    8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,

    with conversion to 32-bit integerswithout penalty.

    Numeric conversions

    L1 D$ becomesas an extension of theregister file

    Mask registers

    have one bit per bit lane,to control which bits of a vector reg.

    or memory data are read or writtenand which remain untouched.

    5.3 Intels Larrabee (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    93/166

    Figure: Layout of the 16-wide vector ALU [31]

    ALUs execute integer, SP and DP FP instructions

    Multiply-add instructions are available.

    ALUs

    5.3 Intels Larrabee (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    94/166

    Task scheduling

    performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.

    5.3 Intels Larrabee (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    95/166

    SP FP performance

    2 operations/cycle16 ALUs

    32 operations/core

    At present no data available for the clock frequency or the number of cores in Larrabee.

    Assuming a clock frequency of 2 GHz and 32 cores

    SP FP performance: 2 TFLOPS

    5.3 Intels Larrabee (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    96/166

    Figure: Larrabees software stack (Source Intel)

    Larrabees Native C/C++ compiler allows many available apps to be recompiled and run

    correctly with no modifications.

    6. References

    6. References (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    97/166

    [2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

    [1]: Torricelli F., AMD in HPC, HPC07,http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf

    [3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

    [4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

    [5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

    [6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495

    [7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

    [8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008

    [9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html

    [10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),

    Version 2.0, Oct. 2008, AMD

    6. References (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    98/166

    [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia

    [12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,

    University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80

    [13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

    [14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

    [15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

    [16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

    [17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,

    PC Perspective, June 16 2008,http://www.pcper.com/article.php?aid=577&type=expert&pid=3

    [18]: http://en.wikipedia.org/wiki/DirectX

    [19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

    [20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,

    Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

    6. References (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    99/166

    [21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

    [22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html

    [24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,

    Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

    [25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008

    [23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

    [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia

    [27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

    [29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453

    [28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

    6. References (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    100/166

    [30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-ahead-since-the-pentium-pro.html

    [31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2

    [32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

    [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1

    http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

    [34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf

    [35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

    [36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

    [37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

    6. References (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    101/166

    [38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

    [39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&mode=print

    [40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

    [41]: Wasson S., AMD's Radeon HD 5870 graphics processor,

    Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1

    [42]: Bell B., ATI Radeon HD 5870 Performance Preview ,Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/default.asp

  • 8/8/2019 GPUs DP Accelerators MSC

    102/166

    5.3 Intels Larrabee

    5.2 Intels Larrabee (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    103/166

    Larrabee

    Part of Intels Tera-Scale Initiative.

    Project started ~ 2005

    First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]

    First official public presentations: in 2008 (e.g. at SIGGRAPH [27])

    Due in ~ 2009

    Performance (targeted):

    2 TFlops

    Brief history:

    Objectives:

    Not a single product but a base architecture for a number of different products.

    High end graphics processing, HPC

    5.2 Intels Larrabee (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    104/166

    NI: New Instructions

    Figure: Positioning of Larrabeein Intels product portfolio [28]

    5.2 Intels Larrabee (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    105/166

    Figure: First public presentation of Larrabee at IDF Fall 2007 [29]

    5.2 Intels Larrabee (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    106/166

    Figure: Block diagram of the Larrabee [30]

    Basic architecture

    Cores: In order x86 IA cores augmented with new instructions

    L2 cache: fully coherent

    Ring bus: 1024 bits wide

    5.2 Intels Larrabee (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    107/166

    Figure: Block diagram of Larrabees cores [31]

    5.2 Intels Larrabee (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    108/166

    Larrabee microarchitecture [27]

    Derived from that of the Pentiums in order design

    5.2 Intels Larrabee (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    109/166

    Figure: The anchestor ofLarrabees cores [28]

    64-bit instructions

    4-way multithreaded(with 4 register sets)

    addition of a 16-wide(16x32-bit) VU

    increased L1 caches(32 KB vs 8 KB)

    access to its 256 KBlocal subset of acoherent L2 cache

    ring network to accessthe coherent L2 $and allow interproc.communication.

    Main extensions

    5.2 Intels Larrabee (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    110/166

    New instructions allow explicit cache control

    the L2 cache can be used as a scratchpad memory while remaining fullycoherent.

    to prefetch data into the L1 and L2 caches

    to control the eviction of cache lines by reducing their priority.

    5.2 Intels Larrabee (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    111/166

    The Scalar Unit

    supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)

    bit count

    bit scan (it finds the next bit set within a register).

    provides new instructions, e.g. for

    Mask registers

    5.2 Intels Larrabee (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    112/166

    Figure: Block diagram of the Vector Unit [31]

    The Vector Unit

    VU scatter-gather instructions

    (load a VU vector register from16 non-contiguous data locationsfrom anywhere from the

    on die L1 cache without penalty,or store a VU register similarly).

    8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,

    with conversion to 32-bit integerswithout penalty.

    Numeric conversions

    L1 D$ becomesas an extension of theregister file

    g

    have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.

    5.2 Intels Larrabee (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    113/166

    Figure: Layout of the 16-wide vector ALU [31]

    ALUs execute integer, SP and DP FP instructions

    Multiply-add instructions are available.

    ALUs

    5.2 Intels Larrabee (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    114/166

    Task scheduling

    performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.

    5.2 Intels Larrabee (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    115/166

    SP FP performance

    2 operations/cycle16 ALUs

    32 operations/core

    At present no data available for the clock frequency or the number of cores in Larrabee.

    Assuming a clock frequency of 2 GHz and 32 cores

    SP FP performance: 2 TFLOPS

    5.2 Intels Larrabee (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    116/166

    Figure: Larrabees software stack (Source Intel)

    Larrabees Native C/C++ compiler allows many available apps to be recompiled and run

    correctly with no modifications.

  • 8/8/2019 GPUs DP Accelerators MSC

    117/166

  • 8/8/2019 GPUs DP Accelerators MSC

    118/166

    4. Overview of data parallel accelerators (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    119/166

    Price relations (as of 10/2008)

    Nvidia Tesla

    C870 ~ 1500 $

    D870 ~ 5000 $

    S870 ~ 7500 $

    C1060 ~ 1600 $

    S1070 ~ 8000 $

    AMD/ATI FireStream

    9170 ~ 800 $ 9250 ~ 800 $

  • 8/8/2019 GPUs DP Accelerators MSC

    120/166

    5. Microarchitecture and operation

    5.1 Nvidias GPGPU line

    5.2 AMD/ATIs GPGPU line

    5.3 Intels Larrabee

  • 8/8/2019 GPUs DP Accelerators MSC

    121/166

    5.1 Nvidias GPGPU line

    Microarchitecture of GPUs

    5.1 Nvidias GPGPU line (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    122/166

    Microarchitecture of GPGPUs

    3-levelmicroarchitectures

    Two-level

    microarchitectures

    Dedicated microarchitecturesa priory developed to support

    both graphics and HPC

    Microarchitecturesinheriting the structure of

    programmable GPUs

    E.g. Nvidias and AMD/ATIsGPGPUs

    IntelsLarrabee

    Figure: Alternative layouts of microarchitectures of GPGPUs

    Microarchitecture of GPUs

    North Bridge Host memoryHost CPU

    5.1 Nvidias GPGPU line (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    123/166

    Cores

    L1 Cache

    Cores

    L1 Cache1 n

    IN

    L2

    MC

    L2

    MC

    Global Memory

    Hub

    Displayc.

    PCI-E

    x16IF

    Work Schedeler

    Command Processor Unit

    Commands

    CBA

    Data

    2x32-bit 2x32-bit

    1 m

    Simplified block diagram of recent 3-level GPUs/data-parallel accelerators

    (Data parallel accelerators do not include Display controllers)

    CB CBCB: Core Blocks

    CBA: Core Block Array

    IN: InterconnectionNetwork

    MC: Memory Controller

    5.1 Nvidias GPGPU line (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    124/166

    In these slides Nvidia AMD/ATI

    C CoreSIMT Core

    SM Streaming MultiprocesszorMultithreaded processor

    Shader-processzorThread processor

    CB Core Block TPC Texture Processor Cluster Multiprocessor

    SIMD ArraySIMD EngineSIMD core

    SIMD

    CBA Core Block Array SPA Streaming Processor Array

    ALU Algebraic Logic Unit Streaming Processor Thread ProcessorScalar ALU

    Stream Processing UnitStream Processor

    Table: Terminologies used with GPGPUs/Data parallel accelerators

    Microarchitecture of Nvidias GPGPUs

    5.1 Nvidias GPGPU line (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    125/166

    Microarchitecture of Nvidia s GPGPUs

    GPGPUs based on 3-level microarchitectures

    Nvidias line AMD/ATIs line

    Figure: Overview of Nvidias and AMD/ATIs GPGPU lines

    90 nm G80

    65 nm G92 G200

    Shrink Enhancedarch.

    80 nm R600

    55 nm RV670 RV770

    Shrink Enhancedarch.

    5.1 Nvidias GPGPU line (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    126/166

    G80/G92

    Microarchitecture

    5.1 Nvidias GPGPU line (6)

  • 8/8/2019 GPUs DP Accelerators MSC

    127/166

    Figure: Overviewof the G80 [14]

    5.1 Nvidias GPGPU line (7)

  • 8/8/2019 GPUs DP Accelerators MSC

    128/166

    Figure: Overviewof the G92 [15]

    5.1 Nvidias GPGPU line (8)

  • 8/8/2019 GPUs DP Accelerators MSC

    129/166

    Figure: The Core Block of theG80/G92 [14], [15]

    5.1 Nvidias GPGPU line (9)

  • 8/8/2019 GPUs DP Accelerators MSC

    130/166

    Figure: Block diagramof G80/G92 cores

    [14], [15]

    Streaming Processors:SIMT ALUs

    Individual components of the core

    5.1 Nvidias GPGPU line (10)

  • 8/8/2019 GPUs DP Accelerators MSC

    131/166

    8K registers (each 4 bytes wide) deliver

    4 operands/clock

    Load/Store pipe can also read/write RF

    I$

    L1

    MultithreadedInstruction Buffer

    RF

    C$L1

    SharedMem

    Operand Select

    MAD SFU

    SM Register File (RF)

    Figure: Register File [12]

    Individual components of the core

    Programmers view of the Register

    5.1 Nvidias GPGPU line (11)

  • 8/8/2019 GPUs DP Accelerators MSC

    132/166

    Programmer s view of the RegisterFile

    There are 8192 and 16384 registers in each SM inthe G80 and the G200 resp.

    This is an implementation decision, not part ofCUDA

    4 thread blocks 3 thread blocks

    Registers are dynamically partitioned acrossall thread blocks assigned to the SM

    Once assigned to a thread block, the register isNOT accessible by threads in other blocks

    Each thread in the same block only accessesregisters assigned to itself

    Figure: The programmers view of the Register File [12]

    The Constant

    5.1 Nvidias GPGPU line (12)

  • 8/8/2019 GPUs DP Accelerators MSC

    133/166

    The ConstantCache

    Immediate address constants

    Indexed address constants

    Constants stored in DRAM, and cached on chip

    L1 per SM

    A constant value can be broadcast to all threads in a Warp

    Extremely efficient way of accessing a value that is common for all

    threads in a Block!

    I$L1

    MultithreadedInstruction Buffer

    RF C$L1 SharedMem

    Operand Select

    MAD SFU

    Figure: The constant cache [12]

    Shared

    5.1 Nvidias GPGPU line (13)

  • 8/8/2019 GPUs DP Accelerators MSC

    134/166

    Memory

    Each SM has 16 KB of Shared Memory

    16 banks of 32 bit words

    CUDA uses Shared Memory as shared storage visible

    to all threads in a thread block

    read and write access

    Not used explicitly for pixel shader programs

    I$L1

    MultithreadedInstruction Buffer

    RF C$L1 SharedMem

    Operand Select

    MAD SFU

    Figure: Shared Memory [12]

    A program needs to manage the global, constant and texture memory spaces

    5.1 Nvidias GPGPU line (14)

  • 8/8/2019 GPUs DP Accelerators MSC

    135/166

    A program needs to manage the global, constant and texture memory spacesvisible to kernels through calls to the CUDA runtime.

    This includes memory allocation and deallocation as well as invoking data transfersbetween the CPU and GPU.

    5.1 Nvidias GPGPU line (15)

  • 8/8/2019 GPUs DP Accelerators MSC

    136/166

    Figure: Major functional blocks of G80/GT92 ALUs [14], [15]

    Barrier synchronization

    5.1 Nvidias GPGPU line (16)

  • 8/8/2019 GPUs DP Accelerators MSC

    137/166

    synchronization is achieved by calling the void_syncthreads() intrinsic function [11];

    used to coordinate memory accesses at synchronization points,

    at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrier synchronization)

    Principle of operation

    5.1 Nvidias GPGPU line (17)

  • 8/8/2019 GPUs DP Accelerators MSC

    138/166

    Based on Nvidias data parallel computing model

    Nvidias data parallel computing model is specified at different levels ofabstraction

    at the Instruction Set Architecture level (ISA) (not disclosed)

    at the intermediate level (at the level ofAPIs) not discussed here)

    at the high level programming language level by means of CUDA.

    CUDA [11]

    5.1 Nvidias GPGPU line (18)

  • 8/8/2019 GPUs DP Accelerators MSC

    139/166

    programming language and programming environment that allows

    explicit data parallel execution on an attached massively parallel device (GPGPU), its underlying principle is to allow the programmer to target portions ofthe

    source code for execution on the GPGPU,

    defined as a set of C-language extensions,

    The key element of the language is the notion ofkernel

    A kernel is specified by

    5.1 Nvidias GPGPU line (19)

  • 8/8/2019 GPUs DP Accelerators MSC

    140/166

    using the _global_declaration specifier,

    a number of associated CUDA threads,

    a domain of execution (grid, blocks) using the syntax

    Execution of kernels

    when called, a kernel is executed N times in parallel by N associated CUDA threads,as opposed to only once like in case of regular C functions.

    Example

    5.1 Nvidias GPGPU line (20)

  • 8/8/2019 GPUs DP Accelerators MSC

    141/166

    adds two vectors A and B of size N and stores the result into vector C

    Remark

    The thread index threadIdx is a vector of up to 3-components,that identifies a one-, two- or three-dimensional thread block.

    The above sample code

    by executing the invoked threads (identified by a one dimensional index i)

    in parallel on the attached massively parallel GPGPU, rather thanadding the vectors A and B by executing embedded loops on the conventional CPU.

    h k l h d b h k b

    5.1 Nvidias GPGPU line (21)

  • 8/8/2019 GPUs DP Accelerators MSC

    142/166

    The kernel concept is enhanced by three key abstractions

    the thread concept,

    the memory concept and

    the synchronization concept.

    The thread concept

    5.1 Nvidias GPGPU line (22)

  • 8/8/2019 GPUs DP Accelerators MSC

    143/166

    based on a three level hierarchy of threads

    grids

    thread blocks

    threads

    The hierarchy of threads

    5.1 Nvidias GPGPU line (23)

  • 8/8/2019 GPUs DP Accelerators MSC

    144/166

    Each kernel invocationis executed as a grid of

    thread blocks (Block(i,j))kernel0()

    kernel1()

    Host Device

    Figure: Hierarchy ofthreads [25]

  • 8/8/2019 GPUs DP Accelerators MSC

    145/166

    The memory concept

    5.1 Nvidias GPGPU line (25)

  • 8/8/2019 GPUs DP Accelerators MSC

    146/166

    private registers (R/W access)

    per block shared memory (R/W access)

    per grid global memory (R/W access)

    per block constant memory (R access)

    per TPC texture memory (R access)

    Threads have

    The global, constant and texturememory spaces can be read from orwritten to by the CPU and arepersistent across kernel launchesby the same application.

    Shared memory is organized into banks(16 banks in version 1)

    Figure: Memory concept [26] (revised)

    Mapping of the memory spaces of the programming modelto the memory spaces of the streaming processor

    5.1 Nvidias GPGPU line (26)

  • 8/8/2019 GPUs DP Accelerators MSC

    147/166

    to the memory spaces of the streaming processor

    Streaming Multiprocessor 1 (SM 1)

    A thread block is scheduled for execution

    to a particular multithreaded SM

    An SM incorporates 8 Execution Units(designated a Processors in the figure)

    SMs are the fundamentalprocessing units for CUDA thread blocks

    Figure: Memory spaces of the SM [7]

    The synchronization concept

    5.1 Nvidias GPGPU line (27)

  • 8/8/2019 GPUs DP Accelerators MSC

    148/166

    synchronization is achieved by the declaration void_syncthreads();

    used to coordinate memory accesses at synchronization points,

    at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrel synchronization)

    Barrier synchronization

    GT200

    5.1 Nvidias GPGPU line (28)

  • 8/8/2019 GPUs DP Accelerators MSC

    149/166

    5.1 Nvidias GPGPU line (29)

  • 8/8/2019 GPUs DP Accelerators MSC

    150/166

    Figure: Block diagram of the GT200 [16]

    5.1 Nvidias GPGPU line (30)

  • 8/8/2019 GPUs DP Accelerators MSC

    151/166

    Figure: The Core Block of theGT200 [16]

    5.1 Nvidias GPGPU line (31)

  • 8/8/2019 GPUs DP Accelerators MSC

    152/166

    Figure: Block diagramof the GT200 cores [16]

    Streaming Multi-processors:SIMT cores

    5.1 Nvidias GPGPU line (32)

  • 8/8/2019 GPUs DP Accelerators MSC

    153/166

    Figure: Major functional blocks of GT200 ALUs [16]

    5.1 Nvidias GPGPU line (33)

  • 8/8/2019 GPUs DP Accelerators MSC

    154/166

    Figure: Die shot of the GT 200 [17]

    6. References

    [1]: Torricelli F., AMD in HPC, HPC07,

    6. References (1)

  • 8/8/2019 GPUs DP Accelerators MSC

    155/166

    [2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

    http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf

    [3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

    [4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

    [5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

    [6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495

    [7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

    [8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008

    [9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html

    [10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),Version 2.0, Oct. 2008, AMD

    [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia

    6. References (2)

  • 8/8/2019 GPUs DP Accelerators MSC

    156/166

    [12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/

    lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80

    [13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

    [14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

    [15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

    [16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

    [17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,PC Perspective, June 16 2008,

    http://www.pcper.com/article.php?aid=577&type=expert&pid=3

    [18]: http://en.wikipedia.org/wiki/DirectX

    [19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

    [20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

    [21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,

    6. References (3)

  • 8/8/2019 GPUs DP Accelerators MSC

    157/166

    http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

    [22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html

    [24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,

    Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

    [25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008

    [23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

    [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia

    [27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,

    ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

    [29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453

    [28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

    [30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-

    6. References (4)

  • 8/8/2019 GPUs DP Accelerators MSC

    158/166

    intels-biggest-leap-ahead-since-the-pentium-pro.html

    [31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2

    [32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

    [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1

    http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

    [34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf

    [35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,

    http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

    [36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

    [37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

    [38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

    6. References (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    159/166

    AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

    6. References (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    160/166

    OpenCL 1.0 compliant

    Introduction: Sept. 22 2009

    Availability: now

    Performance figures:

    Engine clock speed: 850 MHz

    SP FP performance: 2.72 TFLOPS

    DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

    6. References (5)

    Radeon 4800 series/5800 series comparison

  • 8/8/2019 GPUs DP Accelerators MSC

    161/166

    ATI Radeon HD

    4870

    ATI Radeon HD

    5850

    ATI Radeon

    HD 5870Manufacturing Process 55-nm 40-nm 40-nm

    # of Transistors 956 million 2.15 billion 2.15 billion

    Core Clock Speed 750MHz 725MHz 850MHz

    # of Stream Processors 800 1440 1600

    Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

    Memory Type GDDR5 GDDR5 GDDR5Memory Clock 900MHz 1000MHz 1200MHz

    Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps

    Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec

    Max Board Power 160W 170W 188W

    RV770-RV870 Comparison

    6. References (5)

  • 8/8/2019 GPUs DP Accelerators MSC

    162/166

    ATI Radeon HD

    4870

    ATI Radeon HD

    5870

    Difference

    Die Size 263 mm2 334 mm2 1.27x

    # of Transistors 956 million 2.15 billion 2.25x

    # of Shaders 800 1600 2x

    Board Power 90W idle, 160Wload 27W idle, 188Wmax 0.3x, 1.17x

    6. References (5)

    Architecture overview

  • 8/8/2019 GPUs DP Accelerators MSC

    163/166

    8x32 = 256 bitGDDR5

    153.6 GB/s

    1600 ALUs

    (Stream processing units)

    8 cores

    6. References (5)

    The 5870 card

  • 8/8/2019 GPUs DP Accelerators MSC

    164/166

    http://techreport.com/articles.x/17618/3

    The 5870 card

    6. References (5)

    NVidias Fermi

    Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010

  • 8/8/2019 GPUs DP Accelerators MSC

    165/166

    Introduced: 30. Sept. 2009 at NVidia s GPU Technology Conference Available: 1 Q 2010

    6. References (5)

    Fermis overall structure

  • 8/8/2019 GPUs DP Accelerators MSC

    166/166

    rt

    NVidia: 16 cores(Streaming Multiprocessors)

    Each core: 32 ALUs