massively parallel architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · massively...

The CELL processorIntroduction to GPGPU

Conclusion

Massively Parallel ArchitecturesA Take on Cell Processor and GPU programming

Joel Falcou - [email protected]

Bat. 490 - Bureau 104

20 janvier 2009

J. Falcou

Conclusion

Motivation

Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power

Faster computation = more results now

Biology and Health Care

Oiling and Finance

Video Games Industry

The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS

Memory bandwidth : 100-200 GB/s vs 10 GB/s

GPU are in everyday PCs

Cell went from server blade to the game industry (PS3)

J. Falcou

Conclusion

Motivation

Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power

Faster computation = more results now

Biology and Health Care

Oiling and Finance

Video Games Industry

The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS

Memory bandwidth : 100-200 GB/s vs 10 GB/s

GPU are in everyday PCs

Cell went from server blade to the game industry (PS3)

J. Falcou

Conclusion

Motivation

When Video games ruled the WorldGame design has become ever more sophisticated.

Fast GPUs lead to complex shader for real-time effects.

In turn, the demand for speed has led to ever-increasing innovation incard design.

The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.

The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.

J. Falcou

Conclusion

Motivation

When Video games ruled the WorldGame design has become ever more sophisticated.

Fast GPUs lead to complex shader for real-time effects.

In turn, the demand for speed has led to ever-increasing innovation incard design.

The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.

The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.

J. Falcou

Conclusion

Motivation

J. Falcou

Conclusion

Objectives

Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures

... and PracticeIntroduction to GPGPUTools and LanguagesSample code

J. Falcou

Conclusion

Objectives

Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures

... and PracticeIntroduction to GPGPUTools and LanguagesSample code

J. Falcou

Conclusion

ArchitectureCoding for the CELL

Motivation

Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU

The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)

J. Falcou

Conclusion

Motivation

Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU

The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)

J. Falcou

Conclusion

Where to find it ? ? ?

J. Falcou

Conclusion

The CELL Processor

Structure1 PowerPC Processing Unit

8 Synergetic Processing Unit

1 XDRAM Interface

1 4-way DMA bus

Parallelism sourceTLP over the PPE

TLP over the SPE

ILP inside each SPE

J. Falcou

Conclusion

The CELL Processor

J. Falcou

Conclusion

Available Tools

... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process

... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth

J. Falcou

Conclusion

Available Tools

... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process

... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth

J. Falcou

Conclusion

Separate development

Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension

Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless

J. Falcou

Conclusion

Separate development

Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension

Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless

J. Falcou

Conclusion

Memory and Communications

Communicating between PPE and SPEs

SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts

Handling SPE Local Store

SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives

J. Falcou

Conclusion

Communicating between PPE and SPEs

SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts

Handling SPE Local Store

SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives

J. Falcou

Conclusion

MailboxAllow transfer of small data (32bits) between SPE and PPE

Two mailbox per SPE (in and out)

Two mode : waiting or polling

Useful for simple synchronization (thread pool pattern)

Primitives : spe_in_mbox_write and spe_in_mbox_read

Signal

Allow transfer of small data (32bits) between SPEs

Two signal slots per SPE (generic purpose)

Useful for message-passing emulation with DMA transfers

Primitives : mfc_sndsig and spe_read_signal

J. Falcou

Conclusion

MailboxAllow transfer of small data (32bits) between SPE and PPE

Two mailbox per SPE (in and out)

Two mode : waiting or polling

Useful for simple synchronization (thread pool pattern)

Primitives : spe_in_mbox_write and spe_in_mbox_read

Signal

Allow transfer of small data (32bits) between SPEs

Two signal slots per SPE (generic purpose)

Useful for message-passing emulation with DMA transfers

Primitives : mfc_sndsig and spe_read_signal

J. Falcou

Conclusion

DMA Transfers

Principles

Offload the SPU from being blocked during memory transfer

Used to download SPE code into SPE LS

Up to 4 transfers cna be done in parallel over the SPE-Bus

Up to one upload and one download in parallel over the PPE bus

Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

Traps and Pitfalls

Data to send/receive must be aligned on a 128bits boundary

Data size should be 1,2,4,8 or any multiple of 16 bytes

Limited number of DMA channels

Double buffering must be considered

J. Falcou

Conclusion

DMA Transfers

Principles

Offload the SPU from being blocked during memory transfer

Used to download SPE code into SPE LS

Up to 4 transfers cna be done in parallel over the SPE-Bus

Up to one upload and one download in parallel over the PPE bus

Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

Traps and Pitfalls

Data to send/receive must be aligned on a 128bits boundary

Data size should be 1,2,4,8 or any multiple of 16 bytes

Limited number of DMA channels

Double buffering must be considered

J. Falcou

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Motivation

GPU beyond 3D graphics

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications

J. Falcou

Conclusion

Motivation

GPU beyond 3D graphics

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications

J. Falcou

Conclusion

The NVIDIA Products

GeForce seriesSeparate HW interface

Work as an external MPM

Tesla machines8-series GPUs : 200 GFLOPS

stand-alone or 1U rackableunit

J. Falcou

Conclusion

The NVIDIA Products

GeForce seriesSeparate HW interface

Work as an external MPM

Tesla machines8-series GPUs : 200 GFLOPS

stand-alone or 1U rackableunit

J. Falcou

Conclusion

Inside a GPU

Hierarchical Memory

Global Memory

Shared Memory

Local Memory

ProcessorsHigh density SMP

Support 4-way SIMD

J. Falcou

Conclusion

Inside a GPU

Hierarchical Memory

Global Memory

Shared Memory

Local Memory

ProcessorsHigh density SMP

Support 4-way SIMD

J. Falcou

Conclusion

Global View

KernelsA GPGPU application is made of

CPU computation

GPU Kernels

Grids and BlocksKernel = grid of thread blocks

All threads share datamemory space

A thread block is a batch ofthreads that can cooperate

J. Falcou

Conclusion

Global View

KernelsA GPGPU application is made of

CPU computation

GPU Kernels

Grids and BlocksKernel = grid of thread blocks

All threads share datamemory space

A thread block is a batch ofthreads that can cooperate

J. Falcou

Conclusion

Block and Thread IDs

Threads and blocks have IDsEach thread decide the datato process

Block ID : 1D or 2D

Thread ID : 1D, 2D, or 3D

Memory Access

Depend son domain

Image : 2D

Physics : 3D

J. Falcou

Conclusion

Block and Thread IDs

Threads and blocks have IDsEach thread decide the datato process

Block ID : 1D or 2D

Thread ID : 1D, 2D, or 3D

Memory Access

Depend son domain

Image : 2D

Physics : 3D

J. Falcou

Conclusion

Memory Access Patterns

Each thread canR/W per-thread registers

R/W per-thread local memory

R/W per-block sharedmemory

R/W per-grid global memory

Read only per-grid constant

The host canR/W constant memory

R/W texture memory

R/W global memory

J. Falcou

Conclusion

Memory Access Patterns

Each thread canR/W per-thread registers

R/W per-thread local memory

R/W per-block sharedmemory

R/W per-grid global memory

Read only per-grid constant

The host canR/W constant memory

R/W texture memory

R/W global memory

J. Falcou

Conclusion

Global, Constant, and Texture Memories

Global Memory

Main means ofcommunicating between hostand device

Contents visible to all threads

Texture and ConstantConstants initialized by host

J. Falcou

Conclusion

Global, Constant, and Texture Memories

Global Memory

Main means ofcommunicating between hostand device

Texture and ConstantConstants initialized by host

J. Falcou

Conclusion

CUDA Processing Flow

J. Falcou

Conclusion

Copy Processing Data

Create data on HostcudaMallocHost() : allocate memory on the host

cudaMalloc() : allocate memory in the device Global Memory

Copy to Device

cudaMemcpy() : copy memory between host and device

Asynchronous since Cuda 1.1

Works 4-way : (host,device) X (host,device)

Examplefloat *host, *device;

cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);

cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);

J. Falcou

Conclusion

Copy to Device

J. Falcou

Conclusion

Copy to Device

J. Falcou

Conclusion

Instruct the Processing

Define the device mapping

CUDA provides built-in types for dimension

Define a block grid

Define a thread grid

Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid

Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);

device_kernel<<<dimGrid, dimBlock>>>(host,64);

J. Falcou

Conclusion

Define a block grid

J. Falcou

Conclusion

Define a block grid

J. Falcou

Conclusion

Build a Parallel kernel

kernel.cu__global__ void device_kernel(float* data, size_t size){

// Block indexint bx = blockIdx.x;int by = blockIdx.y;

// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the blockint begin = size * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint end = begin + size - 1;// Step size used to iterate through the sub-matrices of Aint step = BLOCK_SIZE;

for(int a = begin; a <= end; a += step)data[a + size * ty + tx] = 255 - data[a + size * ty + tx];

J. Falcou

Conclusion

Sample Code

see mmul.*

J. Falcou

Conclusion

As a Conclusion ...

Some research topics ...High-level tools are needed. WIP includes :

Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library

Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence

J. Falcou

Conclusion

As a Conclusion ...

Some research topics ...High-level tools are needed. WIP includes :

Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library

Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence

J. Falcou

massively parallel architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · massively...

Documents