team programming project
DESCRIPTION
Team Programming Project. Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009. CRA-W/CDC Careers in High Performance Systems (CHiPS) Mentoring Workshop July 25-27 2009 National Center for Supercomputing Applications (NCSA) at - PowerPoint PPT PresentationTRANSCRIPT
Team Programming Project
Byunghyun (Byung) JangPh.D student
Northeastern UniversityJul. 26 2009
CRA-W/CDC Careers in High Performance Systems (CHiPS)Mentoring Workshop
July 25-27 2009National Center for Supercomputing Applications (NCSA) at
University of Illinois at Urbana-Champaign (UIUC)
CHiPS - Team Programming Project
Some words about me
▪4th year Ph.D student
▪Born and raised in South Korea
▪34 years old (never too late to learn)
▪B.S. in mechanical engineering and M.S. in computer science
▪Full time engineer at Samsung Electronics for 3 years
▪GPGPU
▪Internship at AMD and fellowship from AMD
▪Happy
CHiPS - Team Programming Project
Goals
▪Understand General Purpose Computing on GPU (a.k.a. GPGPU)
▪Experience CUDA GPU programming
▪Understand how massively multi-threaded parallel programming works
▪Think about solving a problem in a parallel fashion
▪Experience the tremendous computational power of GPU
▪Experience the challenges in efficient parallel programming
CHiPS - Team Programming Project
Outlines
▪Application 1: Image Rotation
▪ Introduction and Design (15 min)
▪Preparation (5 min)▪Installing a skeleton code, compile test, image view test
▪Hands-on Programming (30 min)▪Replace ??? with your own CUDA code
▪Application 2: Histogram
▪ Introduction and Design (15 min)
▪Preparation (5 min)▪Installing a skeleton code, compile test
▪Hands-on Programming (40 min)▪Replace ??? with your own CUDA code
▪Conclusion
CHiPS - Team Programming Project
Application 1: Image Rotation - Introduction -
Original Input Image Rotated Output Image
▪Rotate an image by a given angle
▪A basic feature in image processing applications
CHiPS - Team Programming Project
▪What the application does:
Step 1. Compute a new location according to the rotation angle(trigonometric computation)
Step 2. Read the pixel value of original location
Step 3. Write the pixel value to the new location computed at Step 1
▪Create the same number of threads as the number of pixels
▪Each thread takes care of moving one pixel
▪Our goals are
▪To understand how to use GPU for data parallelism
▪To know how to map threads to data
Application 1: Image Rotation - Introduction -
CHiPS - Team Programming Project
Application 1: Image Rotation - Design -
ThreadBlock(0, 0)
ThreadBlock(0, 1)
ThreadBlock(0, 63)
ThreadBlock(63, 0)
ThreadBlock
(63, 63)
512
Treads Mapping 512
8
8
CHiPS - Team Programming Project
1. Deploy the skeleton code in the proper directory
[..@ac ~]$ cp /tmp/projects.tar ./
[..@ac ~]$ cp /tmp/cuda.pdf ./
[..@ac ~]$ tar -xf projects.tar
2. Request a cluster node for interactive use for 2 hours
[..@ac ~]$ qsub -I -l walltime=02:00:00
3. Compile
[..@ac ~]$ cd PROJECTS/projects/ImageRotation
[..@ac ~]$ make clean
[..@ac ~]$ make
To use printf() to debug, use “make emu=1” instead of “make”
4. Execute
[..@ac ~]$ ./ImageRotation
5. Convert image from “pgm” to “jpg” format
[..@ac ~]$ convert data/lena_out.pgm data/lena_out.jpg
6. Download “lena_out.jpg” to your laptop to view it
Application 1: Image Rotation - Preparation -
Download for your future reference
CHiPS - Team Programming Project
▪ Replace ??? in the skeleton code with your own CUDA code
▪ Refer to the hints and comments in skeleton code
▪ Talk to me if you have any questions or are done
▪ Try to finish by 2:30 pm
▪ Help others if you finish early
Application 1: Image Rotation - Hands-on Programming -
CHiPS - Team Programming Project
Application 2: Histogram - Introduction -
Input Image Output Histogram
0 (black) 255 (white)
y-axis: Number of Pixels
x-axis: Intensity
▪Shows the frequency of occurrence of the intensity value of each pixel
▪A commonly used analysis tool in image processing and data mining applications
CHiPS - Team Programming Project
▪Serial implementation looks like
▪Access to data[] is sequential but access to histogram[] is random depending on the value, therefore,
▪We will use a fast shared memory to store per-block sub-histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does
Application 2: Histogram - Introduction -
data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin
CHiPS - Team Programming Project
Application 2: Histogram - Design -
▪The structure of shared memory would look like the follow
▪Notice that shared memory is per thread block and limiteddata[DATA_COUNT]
Shared Memorys_hist[]
64 data elements64 data elements
64 data elements64 data elements
CHiPS - Team Programming Project
Application 2: Histogram - Design -
▪Merging per-thread histogram into per-block histogram
Shared Memorys_hist[]
per block
d_result[] # of thread blocks
BIN_COUNT
BIN_COUNT= 64
THREAD_N = 192
BIN_COUNT
final histogram
CHiPS - Team Programming Project
1. Compile[..@ac ~]$ cd PROJECTS/projects/Histogram
[..@ac ~]$ make clean
[..@ac ~]$ make
To use printf() to debug, use “make emu=1” instead of “make”
2. Execute[..@ac ~]$ ./Histogram
4. Check output message
“*** TEST FAILED”: something wrong
“*** TEST PASSED”: you got it
Application 1: Image Rotation - Preparation -
CHiPS - Team Programming Project
Application 1: Histogram - Hands-on Programming -
▪ Replace ??? in the skeleton code with your own CUDA code
▪ Refer to the hints and comments in skeleton code
▪ Talk to me if you have any questions or are done
▪ Try to finish by 3:30 pm
▪ Help others if you finish early
CHiPS - Team Programming Project
Conclusions
▪What we’ve learned throughout the two projects
▪Understood a massive parallel computing on GPU
▪Experienced what CUDA programming looks like
▪Understood how to explicitly program hardware resources
▪Understood the importance and challenges in parallel programming
▪Experienced solving problem in massively parallel fashion
▪GPU is the platform of choice for data-parallel computationally- intensive applications
▪In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance