gpus dp accelerators msc

8/8/2019 GPUs DP Accelerators MSC

1/166

GPGPUs-Data Parallel Accelerators

Dezs Sima

Oct. 20. 2009

Dezs Sima 2009Ver. 1.0


2/166

2. Basics of the SIMT execution

Contents

1.Introduction

3. Overview of GPGPUs

4. Overview of data parallel accelerators

5. Microarchitecture of GPGPUs (examples)

5.1 AMD/ATI RV870 (Cypress)

5.2 Nvidia Fermi

6. References

5.3 Intels Larrabee


3/166

1. The emergence of GPGPUs


4/166

Vertex

Edge Surface

Vertices

have three spatial coordinates supplementary information necessary to render the object, such as

color texture

reflectance properties etc.

Representation of objects by triangels

1. Introduction (1)


5/166

Main types of shaders in GPUs

Shaders

Geometry shadersVertex shaders Pixel shaders(Fragment shaders)

Transform each vertexs3D-position in the virtual space

to the 2D coordinate,at which it appears on the screen

Calculate the colorof the pixels

Can add or removevertices from a mesh

1. Introduction (2)


6/166

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003

9.0 (12/2002) 2.0 2.0

9.0a (3/2003) 2_A, 2_B 2.x

9.0c (8/2004) 3.0 3.0 Windows XP SP2

10.0 (11/2006) 4.0 4.0 Windows Vista

10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008

11 (in development) 5.0 5.0

Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]

1. Introduction (3)


7/166

Convergence of important features of the vertex and pixel shader models

Subsequent shader models introduce typically, a number of new/enhanced features.

Shader model 2 [19]

Different precision requirements

Vertex shader: FP32 (coordinates)Pixel shader: FX24 (3 colors x 8)

Different instructions

Different resources (e.g. registers)

Differences between the vertex and pixel shader models in subsequent shader modelsconcerning precision requirements, instruction sets and programming resources.

Shader model 3 [19]

Unified precision requirements for both shaders (FP32)with the option to specify partial precision (FP16 or FP24)by adding a modifier to the shader code

Different instructions

Different resources (e.g. registers)

1. Introduction (4)


8/166

Shader model 4 (introduced with DirectX10) [20]

Unified precision requirements for both shaders (FP32)

with the possibility to use new data formats. Unified instruction set

Unified resources (e.g. temporary and constant registers)

Shader architectures of GPUs prior to SM4

GPUs prior to SM4 (DirectX 10):have separate vertex and pixel units with different features.

Drawback of having separate units for vertex and pixel shading

Inefficiency of the hardware implementation

(Vertex shaders and pixel shaders often have complementary load patterns [21]).

1. Introduction (5)


9/166

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003

9.0 (12/2002) 2.0 2.0

9.0a (3/2003) 2_A, 2_B 2.x

9.0c (8/2004) 3.0 3.0 Windows XP SP2

10.0 (11/2006) 4.0 4.0 Windows Vista

10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008

11 (in development) 5.0 5.0

Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]

1. Introduction (6)


10/166

Unified shader model (introduced in the SM 4.0 of DirectX 10.0)

The same (programmable) processor can be used to implement all shaders;

the vertex shader

the pixel shader and

the geometry shader (new feature of the SMl 4)

Unified, programable shader architecture

1. Introduction (7)


11/166

Figure: Principle of the unified shader architecture [22]

1. Introduction (8)


12/166

Based on its FP32 computing capability and the large number of FP-units available

the unified shader is a prospective candidate for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs

(General Purpose GPUs)

1. Introduction (9)

or

cGPUs(computational GPUs)


13/166

Figure: Peak SP FP performance of Nvidias GPUs vs Intel P4 and Core2 processors [11]

1. Introduction (10)

1 I d i (11)


14/166

Figure: Bandwidth values of Nvidias GPUs vs Intels P4 and Core2 processors [11]


1 I t d ti (12)


15/166

Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]



16/166

2. Basics of the SIMT execution

2 B i f th SIMT ti (1)


17/166

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

One dimensional data parallel execution,

i.e. it performs the same operation

on all elements of givenFX/FP input vectors

Two dimensional data parallel execution,

i.e. it performs the same operation

on all elements of givenFX/FP input arrays (matrices)

E.g. 2. and 3. generationsuperscalars

GPGPUs,data parallel accelerators

Figure: Main alternatives of data parallel execution

data dependent flow control as well as

barrier synchronization

is massively multithreaded,

and provides

Needs an FX/FP SIMD extensionof the ISA

Needs an FX/FP SIMT extensionof the ISA and the API

2. Basics of the SIMT execution (1)

2 B i f th SIMT ti (2)


18/166

Scalar execution SIMD execution SIMT execution

Domain of execution:single data elements

Domain of execution:elements of vectors

Domain of execution:elements of matrices

(at the programming level)

Figure: Domains of execution in case of scalar, SIMD and SIMT execution


Remark

SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

Scalar, SIMD and SIMT execution

2 Basics of the SIMT execution (3)


19/166

Key components of the implementation of SIMT execution

Data parallel execution Massive multithreading

Data dependent flow control

Barrier synchronization




20/166

(i.e. all ALUs of a SIMT core perform typically the same operation).

Data parallel execution

Fetch/Decode

ALU ALU ALUALU

SIMT core

Figure: Basic layout of a SIMT core

ALU ALU ALUALU

Performed by SIMT cores

SIMT coresexecute the same instruction stream on a number ofALUs

SIMT cores are the basic building blocks ofGPGPU or data parallel accelerators.


During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.



21/166

streaming multiprocessor (Nvidia),

superscalar shader processor (AMD),

wide SIMD processor, CPU core (Intel).

Remark 1

Different manufacturers designate SIMT cores differently, such as




22/166

Fetch/Decode

ALU ALU ALUALU

RF RF RF RF

Each ALU is allocated a working register set (RF)

Figure: Main functional blocks of a SIMT core

ALU ALU ALUALU

RFRFRFRF




23/166

SIMT ALUs perform typically, RRR operations, that is

ALUs take their operands from and write the calculated results to the register set

(RF) allocated to them.

ALU

RF

Figure: Principle of operation of the SIMD ALUs




24/166

Remark 2

ALU

RF RF RF RF RF RF RF RF

ALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs

Actually, the register sets (RF) allocated to each ALU are given parts of alarge enough register file.




25/166

Basic operation of recent SIMT ALUs

ALU

RF

are pipelined,capable of starting a new operation every new clock cycle,(more precisely, every shader clock cycle),

execute basically SP FP-MADD(simple precision i.e. 32-bit.Multiply-Add) instructions of the form axb+c ,

need a few number of clock cycles, e.g. 2 or 4 shader cycles,to present the results of the SP FMADD operations to the RF,

That is, without further enhancements

their peak performance is 2 SP FP operations/cycle




26/166

Additional operations provided by SIMT ALUs

FX operationsand FX/FP conversions, DP FP operations,

trigonometric functions (usually supported by special functional units).




27/166

Aim of massive multithreadingto speed up computations by increasing the utilization of available computing resources

in case of stalls (e.g. due to cache misses).


Massive multithreading

Suspend stalled threads from execution and allocate ready to run threads for execution.

When a large enough number of threads are available long stalls can be hidden.

Principle



28/166

Multithreading is implemented by

creating and managing parallel executable threadsfor each data elementof the

execution domain.

Figure: Parallel executable threads for each element of the execution domain

Same instructionsfor all data elements




29/166

Effective implementation of multithreading

if thread switches, called context switches, do not cause cycle penalties.

providing separate contexts (register space) for each thread, and

implementing a zero-cycle context switch mechanism.

Achieved by




30/166

ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU

CTX CTX CTX CTX CTX CTX CTXCTX






Actual context Register file (RF)

Context switch

Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU

Fetch/Decode

SIMT core




31/166

Data dependent flow control

Implemented by SIMT branch processing

In SIMT processing both paths of a branch are executed subsequently such that

for each path the prescribed operations are executed only on those data elements whichfulfill the data condition given for that path (e.g. xi > 0).

Example




32/166

Figure: Execution of branches [24]

The given condition will be checked separately for each thread




33/166

Figure: Execution of branches [24]

First all ALUs meeting the condition execute the prescibed three operations,then all ALUs missing the condition execute the next two operatons




34/166

Figure: Resuming instruction stream processing after executing a branch [24]

( )



35/166


Implemented e.g. in AMDs Intermediate Language (IL) by the fence threads instruction [10].

In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow(CF) instruction format [7].

Remark

Lets wait all threads for completing all prior instructions before executing the next instruction.

( )



36/166

Each kernel invocationlets execute all

thread blocks (Block(i,j))kernel0()

kernel1()

Host Device

Figure: Hierarchy ofthreads [25]

Principle of SIMT execution

( )


37/166

3. Overview of GPGPUs

3. Overview of GPGPUs (1)


38/166

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Dedicated unitssupporting data parallel execution

with appropriateprogramming environment

Programmable GPUswith appropriate

programming environments

E.g. Nvidias 8800 and GTX linesAMDs HD 38xx, HD48xx lines

Nvidias Tesla linesAMDs FireStream lines

Have display outputs No display outputsHave larger memories

than GPGPUs

Figure: Basic implementation alternatives of the SIMT execution

( )



39/166

GPGPUs

Nvidias line AMD/ATIs line

Figure: Overview of Nvidias and AMD/ATIs GPGPU lines

90 nm G80

65 nm G92 G200

Shrink Enhancedarch.

80 nm R600

55 nm RV670 RV770


40 nm Fermi

Shrink

40 nm RV870

ShrinkEnhanced

arch.

Enhancedarch.


40/166



41/166

8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280

Core G80 G80 G92 GT200 GT200

Introduction 11/06 11/06 10/07 6/08 6/08

IC technology 90 nm 90 nm 65 nm 65 nm 65 nm

Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs

Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2

Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz

Computation

No.of ALUs 96 128 112 192 240

Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz

No. FP32 inst./cycle 3* (but only in a few issue cases) 3 3

Peak FP32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS

Peak FP64 performance 77/76 GLOPS

Memory

Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s

Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit

Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s

Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB

Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3

Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit

Mem. contr. Crossbar Crossbar Crossbar Crossbar Crossbar

System

Multi. CPU techn. SLI SLI SLI SLI SLI

Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10 10 10.1 subset 10.1 subset

Table: Main features of Nvidias GPGPUs



42/166

HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

Core R600 R670 R670 RV770 RV770

Introduction 5/07 11/07 11/07 5/08 5/08

IC technology 80 nm 55 nm 55 nm 55 nm 55 nm

Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs

Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2

Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

Computation

No. of ALUs 320 320 320 800 800

Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz

No. FP32 inst./cycle 2 2 2 2 2

Peak FP32 performance 471.6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS

Peak FP64 performance 200 GLOPS 240 GLOPS

Memory

Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)

Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit

Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s

Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB

Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5

Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit

Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar

System

Multi. CPU techn. CrossFire CrossFire X CrossFire X CrossFire X CrossFire X

Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16

MS Direct X 10 10.1 10.1 10.1 10.1

Table: Main features of AMD/ATIs GPGPUs



43/166

Price relations (as of 10/2008)

Nvidia

GTX260 ~ 300 $

GTX280 ~ 600 $

AMD/ATI

HD4850 ~ 200 $

HD4870 na


44/166

4. Overview of data parallel accelerators

4. Overview of data parallel accelerators (1)


45/166

Implementation alternatives of data parallel accelerators

On-dieintegration

On cardimplementation

Recentimplementations

Futureimplementations

E.g. GPU cards

Data-parallelaccelerator cards

Intels Heavendahl

AMDs Torrenzaintegration technology

AMDs Fusionintegration technology

Trend

Figure: Implementation alternatives of dedicated data parallel accelerators

Data parallel accelerators



46/166

On-card accelerators

1U serverimplementations

Cardimplementations

Desktopimplementations

Usually dual cardsmounted into a box,

connected to anadapter card

that is inserted into afree PCI-E x16 slotof thehost PC through a cable.

E.g. Nvidia Tesla D870 Nvidia Tesla S870

Nvidia Tesla S1070AMD FireStream 9250

Nvidia Tesla C870

Nvidia Tesla C1060AMD FireStream 9170

Usually 4 cardsmounted into a 1U server rack,

connected two adapter cardsthat are inserted into

two free PCIEx16 slots of a serverthrough two switches

and two cables.

Single cards fitting

into a free PCI Ex16 slotof the host computer.

Figure:Implementation alternatives of on-card accelerators



47/166

Figure: Main functional units of Nvidias Tesla C870 card [2]

FB: Frame Buffer



48/166

Figure: Nvidas Tesla C870 andAMDs FireStream 9170 cards [2], [3]



49/166

Figure: Tesla D870 desktop implementation [4]



50/166

Figure: Nvidias Tesla D870 desktop implementation [4]



51/166

Figure: PCI-E x16 host adapter card of Nvidias Tesla D870 desktop [4]



52/166

Figure: Concept of Nvidias Tesla S870 1U rack server [5]



53/166

Figure: Internal layout of Nvidias Tesla S870 1U rack [6]



54/166

Figure: Connection cable between Nvidias Tesla S870 1U rack and the adapter cardsinserted into PCI-E x16 slots of the host server [6]



55/166

6/08

GT200-based4 GB GDDR30.936 GLOPS

6/07

G80-based1.5 GB GDDR30.519 GLOPS

Card

Desktop

IU Server

C870

2007 2008

C1060

CUDA

NVidia Tesla

6/07

G80-based2*C870 incl.3 GB GDDR31.037 GLOPS

D870

6/07

G80-based4*C870 incl.6 GB GDDR32.074 GLOPS

S870

6/07

Version 1.0

6/08

GT200-based4*C1060

16 GB GDDR33.744 GLOPS

S1070

11/07

Version 1.01

6/08

Version 2.0

Figure: Overview of Nvidias Tesla family



56/166

6/08

Shipped

11/07

RV670-based2 GB GDDR3

500 GLOPS FP32~200 GLOPS FP64

Card

Stream ComputingSDK

9170

2007 2008

9170

Rapid Mind

AMD FireStream

6/08

RV770-based1 GB GDDR31 TLOPS FP32

~300 GFLOPS FP64

9250

12/07

Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)

Version 1.0

10/08

Shipped

9250

Figure: Overview of AMD/ATIs FireStream family



57/166

Nvidia Tesla cards AMD FireStream cards

Core type C870 C1060 9170 9250

Based on G80 GT200 RV670 RV770

Introduction 6/07 6/08 11/07 6/08

Core

Core frequency 600 MHz 602 MHz 800 MHz 625 MHz

ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ

No. of ALUs 128 240 320 800

Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS

Peak FP64 performance ~200 GLOPS ~250 GLOPS

Memory

Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s

Mem. interface 384-bit 512-bit 256-bit 256-bit

Mem. bandwidth 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s

Mem. size 1.5 GB 4 GB 2 GB 1 GB

Mem. type GDDR3 GDDR3 GDDR3 GDDR3

System

Interface PCI-E x16 PCI-E 2.0x16 PCI-E 2.0x16 PCI-E 2.0x16

Power (max) 171 W 200 W 150 W 150 W

Table: Main features of Nvidias and AMD/ATIs data parallel accelerator cards



58/166


Nvidia Tesla

C870 ~ 1500 $

D870 ~ 5000 $

S870 ~ 7500 $

C1060 ~ 1600 $

S1070 ~ 8000 $

AMD/ATI FireStream

9170 ~ 800 $ 9250 ~ 800 $


59/166

5. Microarchitecture of GPGPUs (examples)

5.1 AMD/ATI RV870 (Cypress)

5.2 Nvidia Fermi

5.3 Intels Larrabee


60/166

5.1 AMD/ATI RV870

5.1 AMD/ATI RV870 (1)


61/166

OpenCL 1.0 compliant

AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

Introduction: Sept. 22 2009Availability: now

Performance figures:

SP FP performance: 2.72 TFLOPS

DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

5.1 AMD/ATI RV870 (2)


62/166

Radeon series/5800

ATI Radeon HD 4870 ATI Radeon HD

5850

ATI Radeon HD

5870

Manufacturing Process 55-nm 40-nm 40-nm

# of Transistors 956 million 2.15 billion 2.15 billion

Core Clock Speed 750MHz 725MHz 850MHz

# of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

Memory Type GDDR5 GDDR5 GDDR5

Memory Clock 900MHz 1000MHz 1200MHz

Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps

Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec

Max Board Power 160W 170W 188W

Idle Board Power 90W 27W 27W

Figure: Radeon Series/5800 [42]

5.1 AMD/ATI RV870 (3)


63/166

Radeon 4800 series/5800 series comparison

ATI Radeon HD 4870 ATI Radeon HD

5850

ATI Radeon HD

5870

Manufacturing Process 55-nm 40-nm 40-nm



# of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

Memory Type GDDR5 GDDR5 GDDR5

Memory Clock 900MHz 1000MHz 1200MHz




Idle Board Power 90W 27W 27W

Figure: Radeon Series/5800 [42]

5.1 AMD/ATI RV870 (4)


64/166

8x32 = 256 bitGDDR5

153.6 GB/s

1600 EUs(Stream processing units)

Architecture overview

20 cores

16 ALUs/core

5 EUs/ALU

Figure: Architectureoverview [42]

5.1 AMD/ATI RV870 (5)


65/166

The 5870 card

Figure: The 5870 card [41]


66/166

5.2 Nvidia Fermi

5.2 Nvidia Fermi (1)


67/166

NVidias Fermi

Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010



68/166

NVidia: 16 cores(Streaming Multiprocessors)

6x Dual Channel GDDR5(384 bit)

Fermis overall structure

Each core: 32 ALUs

Figure: Fermis overall structure [40]



69/166

Cuda core(ALU)

1 SM includes 32 ALUs

called Cuda cores by NVidia)

Layout of a core (SM)

Figure: Layout of a core [40]



70/166

A single ALU (Cuda core)

SP FP:32-bit FX: 32-bit

Needs 2 clock cycles

DP FP performance: of SP FP performance!!

DP FP

IEEE 754-2008-compliant

Figure: A single ALU [40]



71/166

Fermis system architecture

Figure: Fermis system architecture [39]



72/166

Contrasting Fermi and GT 200

Figure: Contrasting Fermi and GT 200 [39]



73/166

Each kernel invocationexecutes a grid of


kernel1()

Host Device


The execution of programs utilizing GP/GPUs



74/166

Global scheduling in Fermi

Figure: Global scheduling in Fermi [39]



75/166

Microarchitecture of a Fermi core



76/166

Principle of operation of the G80/G92/Fermi GPGPUs



77/166

Work scheduling

Scheduling thread blocks for execution

Segmenting thread blocks into warps

Scheduling warps for execution

Principle of operation of the G80/G92 GPGPUs

The key point of operation is work scheduling

CUDA Th d Bl k


Th ead sched ling in NVidias GPGPUs


78/166

CUDA Thread Block All threads in a block execute the same

kernel program (SPMD)

Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads

Threads have thread id numbers within

block Thread program uses thread id to selectwork and address shared data

Threads in the same block share data andsynchronize while doing their share of the

work Threads in different blocks cannot

cooperate Each block can execute in any order

relative to other blocs!

CUDA Thread Block

Thread Id #:

0 1 2 3 m

Thread program

Courtesy: John Nickolls,NVIDIA

linois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2

Thread scheduling in NVidias GPGPUs



79/166

t0 t1 t2 tm

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 tm

BlocksBlocks

SM0 SM1

TPC

Figure: Assigning thread blocksto streaming multiprocessors (SM) for execution [12]

Up to 8 blocks can be assignedto an SM for execution

Scheduling thread blocks for execution

TPC: Thread Processing Cluster(Texture Processing Cluster)

A TPC has

2 SMs in the G80/G923 SMs in the G200

A device may run thread blocks sequentiallyor even in parallel, if it has enough resources

for this, or usually by a combination of both.



80/166

t0 t1 t2 t31

t0 t1 t2 t31

Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Streaming Multiprocessor

Shared Memory

Segmenting thread blocks into warps

Threads are scheduled for execution in groups

of 32 threads, called the warps.

For scheduling each thread block is subdividedinto warps.

At any point of time up to 24 warps can bemaintained by the scheduler.

Figure: Segmenting thread blocks in warps [12]

Remark

The number of threads constituting a warpis an implementation decision and notpart of the CUDA programming model.



81/166

Scheduling warps for execution

warp 8 instruction 11

SM multithreadedWarp scheduler




...

time

warp 3 instruction 96Figure: Scheduling warps for execution [12]

The warp scheduler is a zero-overhead scheduler

Only those warps are eligible for executionwhose next instruction has all operands available.

Eligible warps are scheduled

coarse grained (not indicated in the figure) priority based.

All threads in a warp execute the same instructionwhen selected.

4 clock cycles are needed to dispatch the sameinstruction to all threads in the warp (G80)


82/166

5.3 Intels Larrabee

5.3 Intels Larrabee (1)


83/166

Larrabee

Part of Intels Tera-Scale Initiative.

Project started ~ 2005

First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]

First official public presentations: in 2008 (e.g. at SIGGRAPH [27])

Due in ~ 2009

Performance (targeted):

2 TFlops

Brief history:

Objectives:

Not a single product but a base architecture for a number of different products.

High end graphics processing, HPC



84/166

NI: New Instructions

Figure: Positioning of Larrabeein Intels product portfolio [28]



85/166

Figure: First public presentation of Larrabee at IDF Fall 2007 [29]



86/166

Figure: Block diagram of the Larrabee [30]

Basic architecture

Cores: In order x86 IA cores augmented with new instructions

L2 cache: fully coherent

Ring bus: 1024 bits wide



87/166

Figure: Block diagram of Larrabees cores [31]



88/166

Larrabee microarchitecture [27]

Derived from that of the Pentiums in order design



89/166

Figure: The anchestor ofLarrabees cores [28]

64-bit instructions

4-way multithreaded(with 4 register sets)

addition of a 16-wide(16x32-bit) VU

increased L1 caches(32 KB vs 8 KB)

access to its 256 KBlocal subset of acoherent L2 cache

ring network to access

the coherent L2 $and allow interproc.communication.

Main extensions



90/166

New instructions allow explicit cache control

the L2 cache can be used as a scratchpad memory while remaining fullycoherent.

to prefetch data into the L1 and L2 caches

to control the eviction of cache lines by reducing their priority.



91/166

The Scalar Unit

supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)

bit count

bit scan (it finds the next bit set within a register).

provides new instructions, e.g. for



92/166

Figure: Block diagram of the Vector Unit [31]

The Vector Unit

VU scatter-gather instructions

(load a VU vector register from16 non-contiguous data locationsfrom anywhere from the

on die L1 cache without penalty,or store a VU register similarly).

8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,

with conversion to 32-bit integerswithout penalty.

Numeric conversions

L1 D$ becomesas an extension of theregister file

Mask registers

have one bit per bit lane,to control which bits of a vector reg.

or memory data are read or writtenand which remain untouched.



93/166

Figure: Layout of the 16-wide vector ALU [31]

ALUs execute integer, SP and DP FP instructions

Multiply-add instructions are available.

ALUs



94/166

Task scheduling

performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.



95/166

SP FP performance

2 operations/cycle16 ALUs

32 operations/core

At present no data available for the clock frequency or the number of cores in Larrabee.

Assuming a clock frequency of 2 GHz and 32 cores

SP FP performance: 2 TFLOPS



96/166

Figure: Larrabees software stack (Source Intel)

Larrabees Native C/C++ compiler allows many available apps to be recompiled and run

correctly with no modifications.

6. References

6. References (1)


97/166

[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

[1]: Torricelli F., AMD in HPC, HPC07,http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf

[3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008

[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html

[10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),

Version 2.0, Oct. 2008, AMD

6. References (2)


98/166

[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia

[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,

University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80

[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

[14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

[17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,

PC Perspective, June 16 2008,http://www.pcper.com/article.php?aid=577&type=expert&pid=3

[18]: http://en.wikipedia.org/wiki/DirectX

[19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,

Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

6. References (3)


99/166

[21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html

[24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,

Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008

[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia

[27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453

[28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

6. References (4)


100/166

[30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-ahead-since-the-pentium-pro.html

[31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1

http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf

[35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

6. References (5)


101/166

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&mode=print

[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,

Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1

[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/default.asp


102/166

5.3 Intels Larrabee



103/166

Larrabee

Part of Intels Tera-Scale Initiative.

Project started ~ 2005

First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]

First official public presentations: in 2008 (e.g. at SIGGRAPH [27])

Due in ~ 2009

Performance (targeted):

2 TFlops

Brief history:

Objectives:

Not a single product but a base architecture for a number of different products.

High end graphics processing, HPC



104/166

NI: New Instructions

Figure: Positioning of Larrabeein Intels product portfolio [28]



105/166

Figure: First public presentation of Larrabee at IDF Fall 2007 [29]



106/166

Figure: Block diagram of the Larrabee [30]

Basic architecture

Cores: In order x86 IA cores augmented with new instructions

L2 cache: fully coherent

Ring bus: 1024 bits wide



107/166

Figure: Block diagram of Larrabees cores [31]



108/166

Larrabee microarchitecture [27]

Derived from that of the Pentiums in order design



109/166

Figure: The anchestor ofLarrabees cores [28]

64-bit instructions

4-way multithreaded(with 4 register sets)

addition of a 16-wide(16x32-bit) VU

increased L1 caches(32 KB vs 8 KB)

access to its 256 KBlocal subset of acoherent L2 cache

ring network to accessthe coherent L2 $and allow interproc.communication.

Main extensions



110/166

New instructions allow explicit cache control

the L2 cache can be used as a scratchpad memory while remaining fullycoherent.

to prefetch data into the L1 and L2 caches

to control the eviction of cache lines by reducing their priority.



111/166

The Scalar Unit

supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)

bit count

bit scan (it finds the next bit set within a register).

provides new instructions, e.g. for

Mask registers



112/166

Figure: Block diagram of the Vector Unit [31]

The Vector Unit

VU scatter-gather instructions

(load a VU vector register from16 non-contiguous data locationsfrom anywhere from the

on die L1 cache without penalty,or store a VU register similarly).

8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,

with conversion to 32-bit integerswithout penalty.

Numeric conversions

L1 D$ becomesas an extension of theregister file

g

have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.



113/166

Figure: Layout of the 16-wide vector ALU [31]

ALUs execute integer, SP and DP FP instructions

Multiply-add instructions are available.

ALUs



114/166

Task scheduling

performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.



115/166

SP FP performance

2 operations/cycle16 ALUs

32 operations/core

At present no data available for the clock frequency or the number of cores in Larrabee.

Assuming a clock frequency of 2 GHz and 32 cores

SP FP performance: 2 TFLOPS



116/166

Figure: Larrabees software stack (Source Intel)

Larrabees Native C/C++ compiler allows many available apps to be recompiled and run

correctly with no modifications.


117/166


118/166



119/166


Nvidia Tesla

C870 ~ 1500 $

D870 ~ 5000 $

S870 ~ 7500 $

C1060 ~ 1600 $

S1070 ~ 8000 $

AMD/ATI FireStream

9170 ~ 800 $ 9250 ~ 800 $


120/166

5. Microarchitecture and operation

5.1 Nvidias GPGPU line

5.2 AMD/ATIs GPGPU line

5.3 Intels Larrabee


121/166

5.1 Nvidias GPGPU line

Microarchitecture of GPUs

5.1 Nvidias GPGPU line (1)


122/166

Microarchitecture of GPGPUs

3-levelmicroarchitectures

Two-level

microarchitectures

Dedicated microarchitecturesa priory developed to support

both graphics and HPC

Microarchitecturesinheriting the structure of

programmable GPUs

E.g. Nvidias and AMD/ATIsGPGPUs

IntelsLarrabee

Figure: Alternative layouts of microarchitectures of GPGPUs

Microarchitecture of GPUs

North Bridge Host memoryHost CPU



123/166

Cores

L1 Cache

Cores

L1 Cache1 n

IN

L2

MC

L2

MC

Global Memory

Hub

Displayc.

PCI-E

x16IF

Work Schedeler

Command Processor Unit

Commands

CBA

Data

2x32-bit 2x32-bit

1 m

Simplified block diagram of recent 3-level GPUs/data-parallel accelerators

(Data parallel accelerators do not include Display controllers)

CB CBCB: Core Blocks

CBA: Core Block Array

IN: InterconnectionNetwork

MC: Memory Controller



124/166

In these slides Nvidia AMD/ATI

C CoreSIMT Core

SM Streaming MultiprocesszorMultithreaded processor

Shader-processzorThread processor

CB Core Block TPC Texture Processor Cluster Multiprocessor

SIMD ArraySIMD EngineSIMD core

SIMD

CBA Core Block Array SPA Streaming Processor Array

ALU Algebraic Logic Unit Streaming Processor Thread ProcessorScalar ALU

Stream Processing UnitStream Processor

Table: Terminologies used with GPGPUs/Data parallel accelerators

Microarchitecture of Nvidias GPGPUs



125/166

Microarchitecture of Nvidia s GPGPUs

GPGPUs based on 3-level microarchitectures

Nvidias line AMD/ATIs line

Figure: Overview of Nvidias and AMD/ATIs GPGPU lines

90 nm G80

65 nm G92 G200


80 nm R600

55 nm RV670 RV770




126/166

G80/G92

Microarchitecture



127/166

Figure: Overviewof the G80 [14]



128/166

Figure: Overviewof the G92 [15]



129/166

Figure: The Core Block of theG80/G92 [14], [15]



130/166

Figure: Block diagramof G80/G92 cores

[14], [15]

Streaming Processors:SIMT ALUs

Individual components of the core



131/166

8K registers (each 4 bytes wide) deliver

4 operands/clock

Load/Store pipe can also read/write RF

I$

L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

SM Register File (RF)

Figure: Register File [12]

Individual components of the core

Programmers view of the Register



132/166

Programmer s view of the RegisterFile

There are 8192 and 16384 registers in each SM inthe G80 and the G200 resp.

This is an implementation decision, not part ofCUDA

4 thread blocks 3 thread blocks

Registers are dynamically partitioned acrossall thread blocks assigned to the SM

Once assigned to a thread block, the register isNOT accessible by threads in other blocks

Each thread in the same block only accessesregisters assigned to itself

Figure: The programmers view of the Register File [12]

The Constant



133/166

The ConstantCache

Immediate address constants

Indexed address constants

Constants stored in DRAM, and cached on chip

L1 per SM

A constant value can be broadcast to all threads in a Warp

Extremely efficient way of accessing a value that is common for all

threads in a Block!

I$L1


RF C$L1 SharedMem

Operand Select

MAD SFU

Figure: The constant cache [12]

Shared



134/166

Memory

Each SM has 16 KB of Shared Memory

16 banks of 32 bit words

CUDA uses Shared Memory as shared storage visible

to all threads in a thread block

read and write access

Not used explicitly for pixel shader programs

I$L1


RF C$L1 SharedMem

Operand Select

MAD SFU

Figure: Shared Memory [12]

A program needs to manage the global, constant and texture memory spaces



135/166

A program needs to manage the global, constant and texture memory spacesvisible to kernels through calls to the CUDA runtime.

This includes memory allocation and deallocation as well as invoking data transfersbetween the CPU and GPU.



136/166

Figure: Major functional blocks of G80/GT92 ALUs [14], [15]




137/166

synchronization is achieved by calling the void_syncthreads() intrinsic function [11];

used to coordinate memory accesses at synchronization points,

at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrier synchronization)

Principle of operation



138/166

Based on Nvidias data parallel computing model

Nvidias data parallel computing model is specified at different levels ofabstraction

at the Instruction Set Architecture level (ISA) (not disclosed)

at the intermediate level (at the level ofAPIs) not discussed here)

at the high level programming language level by means of CUDA.

CUDA [11]



139/166

programming language and programming environment that allows

explicit data parallel execution on an attached massively parallel device (GPGPU), its underlying principle is to allow the programmer to target portions ofthe

source code for execution on the GPGPU,

defined as a set of C-language extensions,

The key element of the language is the notion ofkernel

A kernel is specified by



140/166

using the _global_declaration specifier,

a number of associated CUDA threads,

a domain of execution (grid, blocks) using the syntax

Execution of kernels

when called, a kernel is executed N times in parallel by N associated CUDA threads,as opposed to only once like in case of regular C functions.

Example



141/166

adds two vectors A and B of size N and stores the result into vector C

Remark

The thread index threadIdx is a vector of up to 3-components,that identifies a one-, two- or three-dimensional thread block.

The above sample code

by executing the invoked threads (identified by a one dimensional index i)

in parallel on the attached massively parallel GPGPU, rather thanadding the vectors A and B by executing embedded loops on the conventional CPU.

h k l h d b h k b



142/166

The kernel concept is enhanced by three key abstractions

the thread concept,

the memory concept and

the synchronization concept.

The thread concept



143/166

based on a three level hierarchy of threads

grids

thread blocks

threads

The hierarchy of threads



144/166

Each kernel invocationis executed as a grid of


kernel1()

Host Device



145/166

The memory concept



146/166

private registers (R/W access)

per block shared memory (R/W access)

per grid global memory (R/W access)

per block constant memory (R access)

per TPC texture memory (R access)

Threads have

The global, constant and texturememory spaces can be read from orwritten to by the CPU and arepersistent across kernel launchesby the same application.

Shared memory is organized into banks(16 banks in version 1)

Figure: Memory concept [26] (revised)

Mapping of the memory spaces of the programming modelto the memory spaces of the streaming processor



147/166

to the memory spaces of the streaming processor

Streaming Multiprocessor 1 (SM 1)

A thread block is scheduled for execution

to a particular multithreaded SM

An SM incorporates 8 Execution Units(designated a Processors in the figure)

SMs are the fundamentalprocessing units for CUDA thread blocks

Figure: Memory spaces of the SM [7]

The synchronization concept



148/166

synchronization is achieved by the declaration void_syncthreads();

used to coordinate memory accesses at synchronization points,

at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrel synchronization)


GT200



149/166



150/166

Figure: Block diagram of the GT200 [16]



151/166

Figure: The Core Block of theGT200 [16]



152/166

Figure: Block diagramof the GT200 cores [16]

Streaming Multi-processors:SIMT cores



153/166

Figure: Major functional blocks of GT200 ALUs [16]



154/166

Figure: Die shot of the GT 200 [17]

6. References

[1]: Torricelli F., AMD in HPC, HPC07,

6. References (1)


155/166

[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf

[3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008

[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html

[10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),Version 2.0, Oct. 2008, AMD

[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia

6. References (2)


156/166

[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/

lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80

[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf

[14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm

[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm

[17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,PC Perspective, June 16 2008,

http://www.pcper.com/article.php?aid=577&type=expert&pid=3

[18]: http://en.wikipedia.org/wiki/DirectX

[19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf

[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html

[21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,

6. References (3)


157/166

http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html

[24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,

Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008

[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia

[27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,

ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453

[28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

[30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-

6. References (4)


158/166

intels-biggest-leap-ahead-since-the-pentium-pro.html

[31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1

http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf

[35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,

http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

6. References (5)


159/166

AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

6. References (5)


160/166

OpenCL 1.0 compliant

Introduction: Sept. 22 2009

Availability: now

Performance figures:

Engine clock speed: 850 MHz

SP FP performance: 2.72 TFLOPS

DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

6. References (5)

Radeon 4800 series/5800 series comparison


161/166

ATI Radeon HD

4870

ATI Radeon HD

5850

ATI Radeon

HD 5870Manufacturing Process 55-nm 40-nm 40-nm



# of Stream Processors 800 1440 1600

Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS

Memory Type GDDR5 GDDR5 GDDR5Memory Clock 900MHz 1000MHz 1200MHz




RV770-RV870 Comparison

6. References (5)


162/166

ATI Radeon HD

4870

ATI Radeon HD

5870

Difference

Die Size 263 mm2 334 mm2 1.27x

# of Transistors 956 million 2.15 billion 2.25x

# of Shaders 800 1600 2x

Board Power 90W idle, 160Wload 27W idle, 188Wmax 0.3x, 1.17x

6. References (5)

Architecture overview


163/166

8x32 = 256 bitGDDR5

153.6 GB/s

1600 ALUs

(Stream processing units)

8 cores

6. References (5)

The 5870 card


164/166

http://techreport.com/articles.x/17618/3

The 5870 card

6. References (5)

NVidias Fermi

Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010


165/166

Introduced: 30. Sept. 2009 at NVidia s GPU Technology Conference Available: 1 Q 2010

6. References (5)

Fermis overall structure


166/166

rt

NVidia: 16 cores(Streaming Multiprocessors)

Each core: 32 ALUs

gpus dp accelerators msc

Documents