team programming project

Team Programming Project

Byunghyun (Byung) JangPh.D student

Northeastern UniversityJul. 26 2009

CRA-W/CDC Careers in High Performance Systems (CHiPS)Mentoring Workshop

July 25-27 2009National Center for Supercomputing Applications (NCSA) at

University of Illinois at Urbana-Champaign (UIUC)

CHiPS - Team Programming Project

Some words about me

▪4th year Ph.D student

▪Born and raised in South Korea

▪34 years old (never too late to learn)

▪B.S. in mechanical engineering and M.S. in computer science

▪Full time engineer at Samsung Electronics for 3 years

▪GPGPU

▪Internship at AMD and fellowship from AMD

▪Happy


Goals

▪Understand General Purpose Computing on GPU (a.k.a. GPGPU)

▪Experience CUDA GPU programming

▪Understand how massively multi-threaded parallel programming works

▪Think about solving a problem in a parallel fashion

▪Experience the tremendous computational power of GPU

▪Experience the challenges in efficient parallel programming


Outlines

▪Application 1: Image Rotation

▪ Introduction and Design (15 min)

▪Preparation (5 min)▪Installing a skeleton code, compile test, image view test

▪Hands-on Programming (30 min)▪Replace ??? with your own CUDA code

▪Application 2: Histogram

▪ Introduction and Design (15 min)

▪Preparation (5 min)▪Installing a skeleton code, compile test

▪Hands-on Programming (40 min)▪Replace ??? with your own CUDA code

▪Conclusion


Application 1: Image Rotation - Introduction -

Original Input Image Rotated Output Image

▪Rotate an image by a given angle

▪A basic feature in image processing applications


▪What the application does:

Step 1. Compute a new location according to the rotation angle(trigonometric computation)

Step 2. Read the pixel value of original location

Step 3. Write the pixel value to the new location computed at Step 1

▪Create the same number of threads as the number of pixels

▪Each thread takes care of moving one pixel

▪Our goals are

▪To understand how to use GPU for data parallelism

▪To know how to map threads to data

Application 1: Image Rotation - Introduction -


Application 1: Image Rotation - Design -

ThreadBlock(0, 0)

ThreadBlock(0, 1)

ThreadBlock(0, 63)

ThreadBlock(63, 0)

ThreadBlock

(63, 63)

512

Treads Mapping 512

8

8


1. Deploy the skeleton code in the proper directory

[..@ac ~]$ cp /tmp/projects.tar ./

[..@ac ~]$ cp /tmp/cuda.pdf ./

[..@ac ~]$ tar -xf projects.tar

2. Request a cluster node for interactive use for 2 hours

[..@ac ~]$ qsub -I -l walltime=02:00:00

3. Compile

[..@ac ~]$ cd PROJECTS/projects/ImageRotation

[..@ac ~]$ make clean

[..@ac ~]$ make

To use printf() to debug, use “make emu=1” instead of “make”

4. Execute

[..@ac ~]$ ./ImageRotation

5. Convert image from “pgm” to “jpg” format

[..@ac ~]$ convert data/lena_out.pgm data/lena_out.jpg

6. Download “lena_out.jpg” to your laptop to view it

Application 1: Image Rotation - Preparation -

Download for your future reference


▪ Replace ??? in the skeleton code with your own CUDA code

▪ Refer to the hints and comments in skeleton code

▪ Talk to me if you have any questions or are done

▪ Try to finish by 2:30 pm

▪ Help others if you finish early

Application 1: Image Rotation - Hands-on Programming -


Application 2: Histogram - Introduction -

Input Image Output Histogram

0 (black) 255 (white)

y-axis: Number of Pixels

x-axis: Intensity

▪Shows the frequency of occurrence of the intensity value of each pixel

▪A commonly used analysis tool in image processing and data mining applications


▪Serial implementation looks like

▪Access to data[] is sequential but access to histogram[] is random depending on the value, therefore,

▪We will use a fast shared memory to store per-block sub-histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does

Application 2: Histogram - Introduction -

data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin


Application 2: Histogram - Design -

▪The structure of shared memory would look like the follow

▪Notice that shared memory is per thread block and limiteddata[DATA_COUNT]

Shared Memorys_hist[]

64 data elements64 data elements

64 data elements64 data elements


Application 2: Histogram - Design -

▪Merging per-thread histogram into per-block histogram

Shared Memorys_hist[]

per block

d_result[] # of thread blocks

BIN_COUNT

BIN_COUNT= 64

THREAD_N = 192

BIN_COUNT

final histogram


1. Compile[..@ac ~]$ cd PROJECTS/projects/Histogram

[..@ac ~]$ make clean

[..@ac ~]$ make

To use printf() to debug, use “make emu=1” instead of “make”

2. Execute[..@ac ~]$ ./Histogram

4. Check output message

“*** TEST FAILED”: something wrong

“*** TEST PASSED”: you got it

Application 1: Image Rotation - Preparation -


Application 1: Histogram - Hands-on Programming -

▪ Replace ??? in the skeleton code with your own CUDA code

▪ Refer to the hints and comments in skeleton code

▪ Talk to me if you have any questions or are done

▪ Try to finish by 3:30 pm

▪ Help others if you finish early


Conclusions

▪What we’ve learned throughout the two projects

▪Understood a massive parallel computing on GPU

▪Experienced what CUDA programming looks like

▪Understood how to explicitly program hardware resources

▪Understood the importance and challenges in parallel programming

▪Experienced solving problem in massively parallel fashion

▪GPU is the platform of choice for data-parallel computationally- intensive applications

▪In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance


Thank you!

team programming project

Documents