ms thesis defense “improving gpu performance by regrouping cpu-memory data” by deepthi gummadi
Post on 30-Dec-2015
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
MS Thesis Defense
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
byDeepthi Gummadi
CoE EECS Department
April 21, 2014
Gummadi 2
About Me
Deepthi Gummadi MS in Computer Networking with Thesis LaTeX programmer at CAPPLab since Fall 2013 Publications
“New CPU- to-GPU Memory Mapping Technique,” in IEEE SouthEast Conference 2014.
“The Impact of Thread Synchronization and Data Parallelism on Multicore Game Programming,” accepted in IEEE ICIEV-2014.
“Feasibility Study of Spider-Web Multicore/Manycore Network Architectures,” currently preparing.
“Investigating Impact of Data Parallelism on Computer Game Engine,” under review, IJCVSP Journal, 2014.
Gummadi 3
Committee Members
Dr. Abu Asaduzzaman, EECS Dept.
Dr. Ramazan Asmatulu, ME Dept.
Dr. Zheng Chen, EECS Dept.
Gummadi 4
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
Outline ►
Introduction Motivation Problem Statement Proposal Evaluation Experimental Results Conclusions Future Work
Q U E S T I O N S ? Any time, please.
Gummadi 5
Introduction
Central Processing Unit (CPU) Technology
Interpret and Execute the program instructions.
What is new about CPU? Initially, Processor evolved in
sequential structure. In millennium, processor
speeds reached parallel. Currently, we have multi core
on-chip CPUs.CPU Speed Chart
Gummadi 6
Cache Memory Organization
Why we use cache memory?
Several memory layers: Lower-level caches –
faster, performing computations.
Higher-level cache – slower, storage purposes.
Intel 4-core processor
Gummadi 7
NVIDIA Graphic Processing Unit
Parallel Processing
Architecture
Components Streaming Multiprocessors
Warp Schedulers Execution pipelines Registers
Memory Organization Shared memory Global memory
GPU Memory Organization
Gummadi 8
CPU and GPU
CPU GPU
Low Latency High Throughput, Moderate Latency
Cache Memory Shared Memory
Optimized MIMD Optimized SIMD
CPU and GPU work together to be more efficient.
Gummadi 9
CPU-GPU Computing Workflow
Step 1: CPU allocates the memory and copies the data.
cudaMallac() cudaMemcpy()
Gummadi 10
CPU-GPU Computing Workflow
Step 2: CPU sends function parameters and instructions to GPU.
Gummadi 11
CPU-GPU Computing Workflow
Step 3: GPU executes the instructions based on received commands.
Gummadi 12
CPU-GPU Computing Workflow
Step 4: After execution, the results will be retrieved from GPU DRAM to CPU memory.
Gummadi 13
Motivation
■ Data level parallelism Spatial data partitioning Temporal data
partitioning Spatial instruction
partitioning Temporal instruction
partitioning
Two Parallelization Strategies
Gummadi 14
Motivation
■ Parallelism and optimization techniques simplifies the programming for CUDA.
■ From developers view the memory is unified.
Gummadi 15
Problem Statement
Traditional CPU to GPU global memory mapping technique is not good for GPU Shared memory
Gummadi 16
Outline ►
Introduction Motivation Problem Statement Proposal Evaluation Experimental Results Conclusions Future Work
Q U E S T I O N S ?Any time, please.
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
Gummadi 17
Proposal
Proposed CPU to GPU memory mapping to improve GPU shared memory performance
Gummadi 18
Proposed Technique
Major Steps:Step 1: Start
Step 2: Analyze problems; determine input parameters.
Step 3: Analyze GPU card parameters/characteristics.
Step 4: Analyze CPU and GPU memory organizations.
Step 5: Determine the number of computations and the number of threads.
Step 6: Identify/Partition the data-blocks for each thread.
Step 7: Copy/Regroup CPU data-blocks to GPU global memory.
Step 8: Stop
Gummadi 19
Proposed Technique
Traditional Mapping
■ Data directly copied from CPU to GPU global memory.
■ Retrieved from different global memory blocks.
■ It is difficult to store the data into GPU shared memory.
Proposed Mapping
■ Data should be regrouped and then copied from CPU to GPU global memory.
■ Retrieved from consecutive global memory blocks.
■ It is easy to store the data into GPU shared memory.
Gummadi 20
Evaluation
System Parameters:
CPU Dual processor
speed: 2.13 GHz
Fermi card: 14 SM, 32 CUDA cores in each SM.
Kepler card: 13 SM, 192 CUDA cores in each SM
Gummadi 21
Evaluation
Memory sizes of CPU and GPU cards.
Input parameters are size of rows and size of columns, whereas the output parameter is time.
Gummadi 22
Evaluation
Electric charge distribution by Laplace’s equation for 2D problem (finite difference approximation)
ϵx(i,j)(Φi+1,j - Φi,j)/dx + ϵy(i,j)(Φi,j+1 - Φi,j)/dy +
ϵx(i-1,j)(Φi,j – Φi-1,j)/dx + ϵx(i,j-1)(Φi,j - Φi,j-1)/dy =0
Φ = electric potential
ϵ = medium permittivity
dx , dy = spatial grid size,
Φi,j = electric potential defined at lattice point (i, j)
ϵx(i,j), ϵy(i,j) = effective x- and y-direction permittivity defined at edges of the element cell (i, j).
Gummadi 23
Evaluation
Electric potential can be considered as same for a uniform material, the equation becomes
(Φi+1,j - Φi,j)/dx + (Φi,j+1 - Φi,j)/dy +
(Φi,j – Φi-1,j)/dx + (Φi,j - Φi,j-1)/dy =0
23
Gummadi 24
Outline ►
Introduction Motivation Problem Statement Proposal Evaluation Experimental Results Conclusions Future Work
Q U E S T I O N S ?Any time, please.
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
Gummadi 25
Experimental Results
■ Conducted study on high electric charge distribution by Laplace’s equation.
■ Implemented on three versions CPU only. GPU with shared memory. GPU without shared memory.
■ Input / Outputs Problem size (n for NxN Matrix) Execution time
Gummadi 26
Experimental Results
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m) Where, 1 <= n <= 8 and 1 <= m <= 8
Validation of our CUDA/C code:
Both CPU/C and CUDA/C programs produce the same values
Gummadi 27
Experimental Results
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m) Where, 1 <= n <= 8 and 1 <= m <= 8
Validation of our CUDA/C code:
Both CPU/C and CUDA/C programs produce the same values
Gummadi 28
Experimental Results
Impact of GPU shared memory As the number of
threads increases the processing time decreases (till 8X8 threads).
After 8X8 threads, GPU with shared memory shows better performance.
Gummadi 29
Experimental Results
Impact of the Number of Threads At a constant shared
memory, the processing time of a GPU decreases as the number of threads increases (till 16X16).
After 16X16 threads, Kepler card shows better performance.
Gummadi 30
Experimental Results
Impact of amount of shared memory As the size of GPU
shared memory increases, the processing time decreases.
Gummadi 31
Experimental Results
Impact of the proposed data regrouping technique In the case of data
regrouping with shared memory, as the number of threads increases the processing time decreases.
Among the GPU with and without shared memory, with shared memory gives better performance for more number of threads.
Gummadi 32
Conclusions
For fast effective analysis of complex systems, high performance computations are necessary.
NVIDIA CUDA CPU/GPU, proves its potential on high computations.
Traditional memory mapping follows locality principle. So, data doesn’t fit in GPU shared memory.
Beneficial to keep data in GPU shared memory than GPU global memory.
Gummadi 33
Conclusions
To overcome this problem, we proposed a new memory mapping between CPU and GPU to improve the performance.
Implemented on three different versions.
Results indicates that proposed CPU-to-GPU memory mapping technique helps in decreasing the overall execution time by more than 75%.
Gummadi 34
Future Extensions
■ Modeling and simulation of Nanocomposites: Nanocomposites requires large number of computations at high speed.
■ Aircraft applications:
High performance computations are required to study the mixture of composite materials.
Gummadi 35
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
Questions?
Gummadi 36
“IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA”
Thank you
Contact:
Deepthi Gummadi
E-mail: dxgummadi@wichita.edu
top related