[ieee 2013 10th international multi-conference on systems, signals & devices (ssd) - hammamet,...

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

Integral Image computation on GPU

Marwa Chouchene Laboratory of Electronics and

Microelectronics (EμE) Faculty of Sciences Monastir

Monastir, Tunisia [email protected]

Fatma Ezahra Sayadi Laboratory of Electronics and



Mohamed Atri Laboratory of Electronics and



Rached Tourki Laboratory of Electronics and



Abstract—In this paper we present an integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU). Our system exploits the parallelisms in computation via the NVIDA CUDA programming model, which is a software platform for solving non-graphics problems in a massively parallel high performance fashion.

We compare the performance of the parallel approach running on the GPU with the sequential CPU implementation across a range of image sizes.

Index Terms—Integral image, GPU, CPU, NVIDIA CUDA.

I. INTRODUCTION The graphics processing unit (GPU) has become an integral

part of today’s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart.

The speed of GPU increases the programming capacity so the resolution and processing complex problems. This effort in computing has positioned the GPU as an alternative to traditional microprocessors in high-performance computing system.

CUDA (Compute Unified Device Architecture) is a general architecture for parallel computing introduced by NVidia in November 2007[1]. It includes a new programming model, new architecture and an different set instruction.

In other words, with the increasing complexity of computing machines, many scientists are interested to improve several algorithms of computer vision. Among this research we find the work of Viola and Jones that introduces the integral image which is a new representation of the image used to calculate rapidly the Haar descriptors.

In this work, in order to evaluate our parallel integral image implementation, we executed the algorithm on different size image in both CPU and GPU.

In this context we start with a short presentation of the architecture of GPU NVidia by CUDA. Then we give details of the integral image. Finally, we conclude this work by presenting the implementation results.

II. GPU NVIDIA BY CUDA In this section we present the basic concepts of the

architecture and programming in GPU by CUDA.

A. Basic concepts CUDA (Compute Unified Device Architecture) is a parallel

programming model that is akin to software that can handle the power of a graphics card.

Three key concepts are the basis of CUDA: a hierarchy of thread groups, shared memories, and barrier of synchronization. These concepts give the programmer the ability to precisely manage the parallelism offered by the graphics card. A CUDA program can then be run on a number of processors; all without the programmer has knowledge of the architecture of the graphics card.

The extensions language C provided by CUDA allows the programmer to define functions called kernels, which will be held in graphics card simultaneously. These functions will therefore be executed N times by N CUDA threads different.

The variable thread is a vector of three components. Threads can be identified by one dimension, two dimensions or three dimensions thus forming blocks in one dimension, two dimensions or three dimensions. The threads of the same block can communicate with each other by sharing their data using shared memory and can synchronize to coordinate their memory accesses.

However, a kernel can be executed by multiple blocks that have the same dimension. In this case the number of threads is the result of multiplying the number of threads per block, by the number of blocks. These blocks are arranged within a grid of dimension less strictly at three as shown in Figure 1.

SSD'13 1569681225

1

2013 10th International Multi-Conference on Systems, Signals & Devices (SSD) Hammamet, Tunisia, March 18-21, 2013

978-1-4673-6457-7/13/$31.00 ©2013 IEEE

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

Fig. 1. Illustration of a grid.

Threads have access to data located in separate memories. Each thread has its own memory, called local memory. The blocks, have a shared memory accessible only by the threads in the block in question. Finally, all threads have access to data in the global memory.

There are also two other memories, namely the constant memory and texture memory.

As shown in figure 2, the threads are running on a separate physical machine (the device), in competition the rest of the code C takes place on the host. Both entities have their own DRAM (Dynamic Random Access Memory), respectively called device memory and host memory. A CUDA program must therefore manipulate the data so that they are copied to the device for processing before being copied back to the host.

Fig. 2. Exchange Host/Device.

B. GPU Architecture When a CUDA program invokes a kernel from the host, the

blocks of the grid are enumerated and distributed to multiprocessors with sufficient resources to execute them. All threads in a block are executed simultaneously on the same multiprocessor. Since all threads in a block have completed their instructions, a new block is started on the multiprocessor.

To handle hundreds of threads, the multiprocessor employs a new architecture called SIMT (Single-Instruction, Multiple Threads). It is therefore the unit SIMT of a multiprocessor, which creates, plans, and executes the threads in groups of 32. These groups of threads are called warps. Threads inside of a warp start their execution at the same instruction in the program but are free to run independently. When a multiprocessor is given one or more blocks of threads to run, he divides them into warps whose execution will be planned by the SIMT unit. The next instruction is launched for all active threads of the warp.

One instruction at a time is performed by warp so performance will be maximum if 32 threads calling warp agree on the path to take.

As shown in Figure 3 below, a personal memory of the multiprocessor is organized in this way [2]:

• A bank of 32 registers per processor • A memory block common to all processors (the shared

memory) • A read-only cache memory designed to accelerate

access to data in the memory constant (which is also a read-only memory), shared between processors

• A read-only cache memory designed to accelerate access to data in texture memory (which is also a read-only memory), shared between processors.

Fig. 3. SIMT multiprocessor with shared memory on board.

2

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

Note: it is possible to read and write in the memory called global and local memory of device, and there is no cache to speed up access to data present in the global and local memory.

C. CUDA Programming The stated goal of the CUDA programming interface is to

enable a programmer familiar with coding C, to be able to start programming on graphic cards.

There are four key elements as the basis for extensions to the C language:

• Types to define the entity that a function must be execute

• Types to define the memory where a variable resides • Particular characters to define the grid during the

execution of a kernel from the host • Standard variables that refer to the indices and

dimensions of the grid parameters Each source file with one of these extensions must be

compiled using nvcc compiler. Using graphics hardware in computers for image

processing and computer vision is a new research area. Image processing is intended to study digital images, to generate the information, interprets the contents. The method presented here used to calculate the integral image.

III. INTEGRAL IMAGE The integral image was proposed by Paul Viola and

Michael Jones in 2001 and has been used in the real-time object detection [3]. The integral image used is constructed from an original image grayscale.

In [4], the integral image has been extended; it is then used to compute the Haar-like features, and characteristics of center-surround.

In the SURF algorithm [5] [6], it accelerates the calculation first and second order of Gaussian derivative that uses the integral image.

In the algorithm CenSurE [7], and its improved version SUSur [8], it uses two levels of filter to approximate the Laplace operator using the integral image.

Fig. 4. A schematic of integral image

An integral image is an intermediate representation of the input image. The value of the integral image at point (x, y) is equal to the sum of all pixels in above and to the left of (x, y) as shown in Figure 4 and according to the formula below:

(1)

Thus the sum of a rectangular region can be evaluated from four references to the integral image, which is calculated in a single scan of the original image [3] (Figure 5).

Fig. 5. Calculate the sum of the rectangle D with the integral image [7]

Therefore the sum of two rectangles descriptor is calculated from the difference between two adjacent rectangles using 6 references to the integral image. Similarly we need 8 references for descriptor at 3 rectangles, and 9 for a descriptor at 4 rectangles [9, 10, 11, 12].

IV. EXPERIMENTAL RESULTS In this section, we present our algorithm to calculate the

integral image. The implementation steps are shown in figure 6:

Fig. 6. Algorithm as implemented on hardware

CPU GPU

Read the input image

Transformed image data series

Transfer data from CPU to GPU

Calculate the integral image in GPU

Transfer data from GPU to CPU

Display the results

3

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556576061

We test our parallel algorithm on the windows platform. Software environment includes Microsoft Visual Studio 2008, CUDA 2.3. Hardware environment includes a consumer-level PC with an Intel Core i5, M560 2.6GHz CPU, 4G RAM and a NVIDIA GeForce 310M video card.

In order to evaluate our parallel integral image implementation, we executed our algorithm for different image sizes in both CPU and GPU (Table 1).

TABLE I. THE TIME-CONSUMING COMPARISON BETWEEN CPU BASED AND GPU BASED ALGORITHMS

As can be seen from the table 1, compared to the

corresponding CPU-based serial algorithm, our algorithm has a relatively reduction in time-consuming. With the increasing of the image size used in the experiment, speedup increases. However, due to the memory limitations in the video chip, the processed image size at one time cannot be increased unlimitedly.

Figure 7 gives the runtime comparison for various kernels implemented on the GPU. The comparison is done relative to each other considering each kernel is executed once on the GPU.

Fig. 7. GPU Kernels runtime comparison

V. CONCLUSION In this paper, a parallel integral image algorithm, is

presented and implemented on GPU, and compared with the sequential implementations based on CPU. Performance results indicate that significant speedup can be achieved. The integral image can get speedup of up to 3× compared to CPU-based implementations. Obviously, GPU provides a novel and

efficient acceleration technique for image processing, and is cheaper in hardware implementation.

REFERENCES

[1] NVIDIA Corporation, “NVIDIA CUDA Compute Unified Device Architecture Programming Guide,” Version 1.1, 2007

[2] D John Owens, Mike Houston, David Luebke, “GPU Computing,” Proceedings of the IEEE, vol. 96, no. 5, May 2008

[3] P Viola, M Jones, “Rapid object detection using a boosted cascade of simple features,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2001

[4] R Lienhart, A Kuranov, V Pisarevsky, “Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection,” Pattern Recognition. vol. 2781, pp. 297-304, 2003

[5] H Bay, A Ess, T Tuytelaars, L Van Gool, “Speededup robust features (SURF),” International Journal on Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008

[6] H Bay, T Tuytelaars, L Van Gool, “SURF: Speeded up robust features,” Proceedings of the European Conference on Computer Vision, Springer LNCS volume 3951, part 1, pp 404–417,2006.

[7] M Agrawal, K Konolige, M R Blas “Censure: Center surround extremas for realtime feature detection and matching,” In D. A. Forsyth, P. H. S. Torr, and A. Zisserman, editors, ECCV (4), volume 5305 of Lecture Notes in Computer Science, pp 102–115. Springer, 2008.

[8] M Ebrahimi, W W Mayol-Cuevas, “SUSurE: Speeded Up Surround Extrema feature detector and descriptor for realtime applications,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, cvprw, pp.9-14, 2009

[9] Pablo Augusto Negri, “Détection et Reconnaissance d’objets structurés : Application aux Transports Intelligents,” thesis defended in September 2008, University Pierre and Marie Curie - Paris VI, Institute of Intelligent Systems and Robotics.

[10] Lemaire Pierre, “Etude de la pertinence topologique des descripteurs d’images utilisés dans les algorithmes de détection de visages par apprentissage,” Master course in 2008.

[11] Shuji Zhao, “Apprentissage et Recherche par le Contenu Visuel de Catégories Sémantiques d'Objets Vidéo,” Master memory supported in July 2007, University Paris Descartes, Laboratory of Image Processing and Signal CNRS France.

[12] P A Negri, L Prévost, X Clady, “Cascade generative and discriminative classifiers for vehicle detection,” 16 French Congress AFRIFAFIA, Pattern Recognition and Artificial Intelligence, Amiens, France, pp 22-25 January 2008.

Image Size CPU based time consuming

GPU based time consuming (ms)

Speed up

128*128 0,0099 0,0041 2,41 256*256 0,0129 0,0049 2,63 512*512 0,0178 0,0054 3,29

4

[ieee 2013 10th international multi-conference on systems, signals & devices (ssd) - hammamet,...

Documents