massively parallel architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · massively...
TRANSCRIPT
The CELL processorIntroduction to GPGPU
Conclusion
Massively Parallel ArchitecturesA Take on Cell Processor and GPU programming
Joel Falcou - [email protected]
Bat. 490 - Bureau 104
20 janvier 2009
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Motivation
Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power
Faster computation = more results now
Biology and Health Care
Oiling and Finance
Video Games Industry
The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS
Memory bandwidth : 100-200 GB/s vs 10 GB/s
GPU are in everyday PCs
Cell went from server blade to the game industry (PS3)
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Motivation
Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power
Faster computation = more results now
Biology and Health Care
Oiling and Finance
Video Games Industry
The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS
Memory bandwidth : 100-200 GB/s vs 10 GB/s
GPU are in everyday PCs
Cell went from server blade to the game industry (PS3)
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Motivation
When Video games ruled the WorldGame design has become ever more sophisticated.
Fast GPUs lead to complex shader for real-time effects.
In turn, the demand for speed has led to ever-increasing innovation incard design.
The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.
The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Motivation
When Video games ruled the WorldGame design has become ever more sophisticated.
Fast GPUs lead to complex shader for real-time effects.
In turn, the demand for speed has led to ever-increasing innovation incard design.
The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.
The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Motivation
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Objectives
Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures
... and PracticeIntroduction to GPGPUTools and LanguagesSample code
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
Objectives
Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures
... and PracticeIntroduction to GPGPUTools and LanguagesSample code
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Motivation
Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU
The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Motivation
Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU
The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Where to find it ? ? ?
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
The CELL Processor
Structure1 PowerPC Processing Unit
8 Synergetic Processing Unit
1 XDRAM Interface
1 4-way DMA bus
Parallelism sourceTLP over the PPE
TLP over the SPE
ILP inside each SPE
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
The CELL Processor
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Available Tools
... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process
... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Available Tools
... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process
... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Separate development
Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension
Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Separate development
Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension
Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Memory and Communications
Communicating between PPE and SPEs
SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts
Handling SPE Local Store
SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Memory and Communications
Communicating between PPE and SPEs
SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts
Handling SPE Local Store
SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Memory and Communications
MailboxAllow transfer of small data (32bits) between SPE and PPE
Two mailbox per SPE (in and out)
Two mode : waiting or polling
Useful for simple synchronization (thread pool pattern)
Primitives : spe_in_mbox_write and spe_in_mbox_read
Signal
Allow transfer of small data (32bits) between SPEs
Two signal slots per SPE (generic purpose)
Useful for message-passing emulation with DMA transfers
Primitives : mfc_sndsig and spe_read_signal
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
Memory and Communications
MailboxAllow transfer of small data (32bits) between SPE and PPE
Two mailbox per SPE (in and out)
Two mode : waiting or polling
Useful for simple synchronization (thread pool pattern)
Primitives : spe_in_mbox_write and spe_in_mbox_read
Signal
Allow transfer of small data (32bits) between SPEs
Two signal slots per SPE (generic purpose)
Useful for message-passing emulation with DMA transfers
Primitives : mfc_sndsig and spe_read_signal
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
DMA Transfers
Principles
Offload the SPU from being blocked during memory transfer
Used to download SPE code into SPE LS
Up to 4 transfers cna be done in parallel over the SPE-Bus
Up to one upload and one download in parallel over the PPE bus
Primitives : mfc_get,mfc_put and mfc_read_tag_status_all
Traps and Pitfalls
Data to send/receive must be aligned on a 128bits boundary
Data size should be 1,2,4,8 or any multiple of 16 bytes
Limited number of DMA channels
Double buffering must be considered
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
ArchitectureCoding for the CELL
DMA Transfers
Principles
Offload the SPU from being blocked during memory transfer
Used to download SPE code into SPE LS
Up to 4 transfers cna be done in parallel over the SPE-Bus
Up to one upload and one download in parallel over the PPE bus
Primitives : mfc_get,mfc_put and mfc_read_tag_status_all
Traps and Pitfalls
Data to send/receive must be aligned on a 128bits boundary
Data size should be 1,2,4,8 or any multiple of 16 bytes
Limited number of DMA channels
Double buffering must be considered
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Motivation
GPU beyond 3D graphics
Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation
Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Motivation
GPU beyond 3D graphics
Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation
Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
The NVIDIA Products
GeForce seriesSeparate HW interface
Work as an external MPM
Tesla machines8-series GPUs : 200 GFLOPS
stand-alone or 1U rackableunit
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
The NVIDIA Products
GeForce seriesSeparate HW interface
Work as an external MPM
Tesla machines8-series GPUs : 200 GFLOPS
stand-alone or 1U rackableunit
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Inside a GPU
Hierarchical Memory
Global Memory
Shared Memory
Local Memory
ProcessorsHigh density SMP
Support 4-way SIMD
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Inside a GPU
Hierarchical Memory
Global Memory
Shared Memory
Local Memory
ProcessorsHigh density SMP
Support 4-way SIMD
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Global View
KernelsA GPGPU application is made of
CPU computation
GPU Kernels
Grids and BlocksKernel = grid of thread blocks
All threads share datamemory space
A thread block is a batch ofthreads that can cooperate
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Global View
KernelsA GPGPU application is made of
CPU computation
GPU Kernels
Grids and BlocksKernel = grid of thread blocks
All threads share datamemory space
A thread block is a batch ofthreads that can cooperate
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Block and Thread IDs
Threads and blocks have IDsEach thread decide the datato process
Block ID : 1D or 2D
Thread ID : 1D, 2D, or 3D
Memory Access
Depend son domain
Image : 2D
Physics : 3D
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Block and Thread IDs
Threads and blocks have IDsEach thread decide the datato process
Block ID : 1D or 2D
Thread ID : 1D, 2D, or 3D
Memory Access
Depend son domain
Image : 2D
Physics : 3D
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Memory Access Patterns
Each thread canR/W per-thread registers
R/W per-thread local memory
R/W per-block sharedmemory
R/W per-grid global memory
Read only per-grid constant
The host canR/W constant memory
R/W texture memory
R/W global memory
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Memory Access Patterns
Each thread canR/W per-thread registers
R/W per-thread local memory
R/W per-block sharedmemory
R/W per-grid global memory
Read only per-grid constant
The host canR/W constant memory
R/W texture memory
R/W global memory
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Global, Constant, and Texture Memories
Global Memory
Main means ofcommunicating between hostand device
Contents visible to all threads
Texture and ConstantConstants initialized by host
Contents visible to all threads
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Global, Constant, and Texture Memories
Global Memory
Main means ofcommunicating between hostand device
Contents visible to all threads
Texture and ConstantConstants initialized by host
Contents visible to all threads
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
CUDA Processing Flow
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Copy Processing Data
Create data on HostcudaMallocHost() : allocate memory on the host
cudaMalloc() : allocate memory in the device Global Memory
Copy to Device
cudaMemcpy() : copy memory between host and device
Asynchronous since Cuda 1.1
Works 4-way : (host,device) X (host,device)
Examplefloat *host, *device;
cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);
cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Copy Processing Data
Create data on HostcudaMallocHost() : allocate memory on the host
cudaMalloc() : allocate memory in the device Global Memory
Copy to Device
cudaMemcpy() : copy memory between host and device
Asynchronous since Cuda 1.1
Works 4-way : (host,device) X (host,device)
Examplefloat *host, *device;
cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);
cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Copy Processing Data
Create data on HostcudaMallocHost() : allocate memory on the host
cudaMalloc() : allocate memory in the device Global Memory
Copy to Device
cudaMemcpy() : copy memory between host and device
Asynchronous since Cuda 1.1
Works 4-way : (host,device) X (host,device)
Examplefloat *host, *device;
cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);
cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Instruct the Processing
Define the device mapping
CUDA provides built-in types for dimension
Define a block grid
Define a thread grid
Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid
Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);
device_kernel<<<dimGrid, dimBlock>>>(host,64);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Instruct the Processing
Define the device mapping
CUDA provides built-in types for dimension
Define a block grid
Define a thread grid
Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid
Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);
device_kernel<<<dimGrid, dimBlock>>>(host,64);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Instruct the Processing
Define the device mapping
CUDA provides built-in types for dimension
Define a block grid
Define a thread grid
Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid
Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);
device_kernel<<<dimGrid, dimBlock>>>(host,64);
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Build a Parallel kernel
kernel.cu__global__ void device_kernel(float* data, size_t size){
// Block indexint bx = blockIdx.x;int by = blockIdx.y;
// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the blockint begin = size * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint end = begin + size - 1;// Step size used to iterate through the sub-matrices of Aint step = BLOCK_SIZE;
for(int a = begin; a <= end; a += step)data[a + size * ty + tx] = 255 - data[a + size * ty + tx];
}
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
The NVIDIA ArchitectureProgramming with CUDA
Sample Code
see mmul.*
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
As a Conclusion ...
Some research topics ...High-level tools are needed. WIP includes :
Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library
Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence
J. Falcou
The CELL processorIntroduction to GPGPU
Conclusion
As a Conclusion ...
Some research topics ...High-level tools are needed. WIP includes :
Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library
Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence
J. Falcou