sift gpu

SIFT on GPU

Changchang WuUniversity of North Carolina

[email protected]

Abstract

Lowe’s SIFT (Scale Invariant Feature Transform) [5] de-tect similarity invariant features in gaussian scale space ofimages, and it has been successfully applied in many com-puter vision problems [3, 7]. By exploiting the data parallelcomputing feature of GPU, scale invariant feature trans-form can run much faster on GPU than on CPU. This reportwill discuss the implementation details of different stages ofSIFT, and show the result of them.

1 Introduction

The the parallel computing nature and programmabilitypipeline makes GPU a powerful tool for data parallel com-putation problems, and it has been widely used for generalpurpose computation [1]. This project is trying to utilize thecomputing power of GPU to run fast scale invariant trans-form.

SIFT detects the local maxima and minima of differenceof gaussian in the gaussian scale space. Local dominant gra-dient orientations are then computed for each feature point,and sub-pixel localization is applied. Descriptors are thengenerated from the scale-and-orientation-normalized imagepatche for each feature.

The first part, scale space computation, can be cast to apixel parallel computation. It runs gaussian filters on inputimages to get each pixel of new filtered images, and GPUcan use a fragment shader to compute multiple pixels simul-taneous. The second part, localization, orientation compu-tation, and descriptor generation, can be seen as a featureparallel computation. Each feature can also be mapped to apixel to run parallel on GPU.

The rest of this report is organized as follows: Section2 discusses some existing SIFT implementations and talksabout their features. Section 3 will then explain the imple-mentation details of different stages of SIFT. Conclusionsand future work are given in the end.

2 Related SIFT Implementations

Besides the original binary given by Lowe, there hasbeen several existing CPU and GPU implementations. Theoriginal version, dated 2005, is fine, but it is only a binarywithout much you can change. With the strong interest inSIFT of many researcher, there then come several SIFT im-plementations in different programming languages includ-ing C#, Matlab, and C++.

Sift++ [8], a nice C++ version developed by AndreaVedaldi, gives users a lot of flexibility. With this implemen-tation, it is easy to change many parameters of SIFT, forexample, number of octaves, number of DOG levels, edgethreshold, etc. This kind of flexibility is also a goal of ourGPU implementation.

Sudipta Sinha is the first one to implement SIFT onGPU [6]. This version was using OpenGL+cg as the shaderlanguage, and achieved a high speed up over CPU. Due tosome hardware limitations and OpenGL limitations at thattime, several important steps were not running on GPU, andthere are data transfers between GPU and CPU which tooka fair portion time.

Recently, [4] demonstrates another SIFT implementationon GPU. This version has a smart idea to achieve high per-formance on scale space generation by packing 2x2 squaresin a single pixel to save the number of texture fletchings. Itis not clear how the result is finally downloaded from GPUand allocating 128 vector for every pixels instead of everyfeature is a waste of memory and bandwidth. This paperalso does not claim the support of multiple orientations ofone keypoint.

The goal of this project is to combine the flexibility andgenerality feature of SIFT++, and implement a free sourcelibrary of SIFT on GPU. Unlike the Sif++ library, both ofthe above two GPU versions are commercial versions thatcannot be distributed, and it is also unclear how the flexibil-ity of them are.

1

3 SIFT on GPU

This section discusses some details of implementingSIFT using GPU shaders. First the overall design of thislibrary is discussed, and then details of each stages of SIFT.

3.1 Shader Language

Here the traditional GPU shaders are chosen as the im-plementation tool instead of CUDA, considering the factthat images are easily mapped to textures on GPU, . Ini-tially, this work was using GLSL( OpenGL Shading Lan-guage), and later CG version of the shaders are also devel-oped, and a parameter is provided to switch between them.The two version are also able to work on both nVidia andATI graphic cards.

3.2 Storage Design

The level images in the scale space are intuitively storedas pixel-by-pixel mapped texture. Shown in Fig. 1, the fourcolor channels RGBA are used to store intensity, differenceof Gaussian, gradient magnitude, and gradient orientationrespectively.

R

G

B

A

Intensity

DOG

GradientMagnitudeGradientOriteation

Figure 1. Storage of scale space as texture

To save memory usage, feature list is used in this im-plementation. The feature list are also stored as textures asshown in Fig. 2. Features on different levels are stored sep-arately and the scale information does not need to be stored.After the first stage of feature detection, a feature list texturethat saves feature location and orientation count are used,and then the feature list texture is reshaped to make a listof features with separate orientations. A point that needs tomention is that all the feature list generation and feature listreshaping are implemented on GPU, which is different withSinha’s download/Upload.

Descriptor generation is currently not finished yet, but Iam planing to use the method in [4]. In this method twotextures are used to store descriptor data, and 32 pixels (16pixels from each ) are used to save the 128D (32*4 = 128)feature vector.

3.3 Scale Space Generation

Similar with Sinha’s work, separable gaussian filtering isused to run filtering horizontally and vertically separately.

R

G

B

A

X

Y

# of

Orientations

R

G

B

A

Orientation 1

Orientation 2

Orientation 3

Orientation 4

R

G

B

A

X

Y

Orientation 1

R

G

B

A

X

Y

Orientation 2

Figure 2. Storage of feature list as textures.

This is very necessary to achieve good performance, be-cause the gaussian kernel needs to be very large for largeσ. 6σ is used as the filter width, when the number of DOGlevels is 3, the largest gaussian filter σ will be 3.0, and itwill require a 19x19 gaussian kernel. Using separable fil-ter will save a lot texture fetches and also reduce the shadercode size.

Gaussian Filter Shaders are generated on the fly accord-ing to the parameter user inputs, each with different dif-ferent sizes and kernels. Multiple texture coordinate fea-ture of OpenGL is used, and when the number of coordi-nates is more than 8, they will be computed automaticallyin shaders.

R

G

B

A

Intensity1 R

G

B

A

Intensity1

Temp

R

G

B

A

Intensity2

DOG

TempHorizontalFiltering

VerticalFiltering

Copy

Copy

Subtract

Input Texture Output Texture

Figure 3. Two pass of gaussian filter thatuses texture from destination.

Fig. 3 demonstrates the two stage gaussian filter. Thesecond pass, by carefully writing back the temporary inten-sity to the original color channel, can read and write thesame texture. My experiments show that reading and writ-ing the same texture is faster than PingPong [2]. It can beexplained by that PingPong requires more switching of tex-ture caching. Difference of Gaussian is also computed inthe second pass.

After one octave is computed, sub-samping is used to getthe first several level images of the next octave. For examplewhen the level range is from −1 to s + 2, the scale doublesevery s steps. There are 3 pairs of doubling in one octave,

2

and the highest 3 level of an octave can be used to generatethe first 3 level of the next octave. One restriction is that thefilter size cannot be truncated for higher level, other wise thegaussian will be inaccurate for sub-sampling. When sub-sampling more than one level, both intensity and DOG canbe generated from sub-sampling, and this can save sometime on filtering. This trick hasn’t been seen in other SIFTimplementations.

Figure 4. Gaussian scale space Pyramid andDOG Pyramid (Absolute value).

Fig 4 shows the gaussian scale space and the absolutevalue of DOG. Images with the same dimension are the dif-ferent levels of a same octave.

3.4 Keypoint Detection

Keypoint detection need to compare the DOG of a pixelwith its 26 neighbours in the scale space. This step is split

into intra-level suppression and inter-level to save texturefetching. As shown in Fig 5, the first pass will comparethe DOG value of a pixel with its 8 neighbours, and savewhether the point is a local minimum and local maximum toan auxiliary texture. The maximum and minimum of the 9pixels are also stored in the auxiliary texture. Gradient mag-nitude and orientation is also computed in this pass. Edgeelimination is also applied in this pass to delete the featuresthat are on edges.

In the second pass, early-z if first applied to exclude thepixels that are already filtered out in the first pass, then eachpixel is compared with the maximum or minimum value ofits 2 neighbor in the scale space. A point is the maximumin the 3x3x3 cube only when it is identified as a intra-levellocal maximum and it is larger than the maximum values inits two neighbours. Similar thing applies to minimum.

R

G

B

A

Intensity

DOG

Gradient

Magnitude

Orientation

R

G

B

A

IsKey

DOG

Maximum

Minimum

R

G

B

A

IsKey

DOG

Maximum

Minimum

Intra-level Suppression

With 8 neighboursInter-level Suppression

With 2 neighbours

Early-Z

Compute Gradient

and write back

Scale Space Texture Auxiliary Texture

Figure 5. Keypoint Detection.

3.5 Feature List Generation

Method in [9] is used here to generate feature lists onGPU. Our implementation used the full 4 color channel tobuild the histogram pyramid, which can be seen as pointertextures, and the feature list generated by tracking down thehistogram pyramid. For every image, only one pixel at thetop of the histogram pyramid needs to be read back, and thenumber of features is the sum of the four channels. Thismethod can avoid the read-back of textures, and also avoidthe upload of feature list. The left image in Fig 2 shows howthe final results are.

3.6 Orientation Computation

This step computes the orientation candidates foreach feature. It first obtain an weighted orientationhistogram in the circular window of radius 3σ, thenapply smoothing on the histogram, and finally the angleswhose voting is larger than 0.8 times the maximumare outputted. The 36 angle for orientation histogramis implemented as 9 float4/vec4. Since GPU arraysdoes not support dynamic indexing, a binary search ofindex is used here to locate the expected 4-angle bin.Then this bin is added with a voting vector as followsbin += weight * float4( fmod(idx,4) == float4(0,1,2,3) ).

3

With this kind of 4 angle bins, smoothing can be easilyapplied with a larger window. The smoothing in sift++ runs(1, 1, 1)/3 filtering for 6 times, and, because four valuesare stored in one bin, it can be implemented as running(1, 3, 6, 7, 6, 3, 1)/27 filtering for twice.

Finally, the orientations are writing to the orientation tex-ture, and the numbers of features are writing to the originalfeature texture. Then the point list generation method as inthe feature list generation is used again to reshape the fea-ture list, and this step is shown in Fig 2. Instead of assigningdifferent point location in the last step, different feature ori-entations are assigned to different feature candidates.

3.7 Descriptor Generation

Descriptor generation is not currently not finished.

3.8 Display List Generation

This section shows that display list can also be generatedon GPU without reading back the features. SIFT featureshere are displayed as scaled and rotated squares. A texturewith 4 times space is allocated for saving the output ver-tices, and a shader will automatically compute the featureindex of the point, and also the sub-index in the rectangle.The point can then be rotated and translated according tothe feature orientation and scale. Fig 6 shows this vertexgeneration. The vertex result can then be copying to a Ver-tex Buffer Object to demonstrate SIFT features. Fig 7 showan example of the result.

Figure 6. Display vertex generation.

4 Result

The current implementation runs about 6-7 hz on nVidia7900 GTX on average. Due to the limit of time, here onlyone sample result from nvidia 7900 GTX are presented. Thesize of the test image is 800x600. In the first round, it spend0.157 second in pyramid generation, 0.265 second in thefirst pass of Keypoint detection, 0.031 second in the sec-ond pass, 0.094 second in generate the feature list on GPU,0.266 second in orientation computation, and 0.031 in fea-ture list reshaping. The first run of SIFT will be slower be-cause of the starting overhead, but it is getting stable fromthe second round.After the first round, the following pro-cessing can finish all the stages in about 0.157 second.

Figure 7. Keypoint Detection Result (Theshown rectangle size is 2σ instead of 6σ ).

This version of SIFT can also handle very large images,normally it takes 0.9 second on a 1600x1200 image, and 2.8second on a 2048*1365 image.

5 Conclusions and Future Work

The project currently finished scale space generation,keypoint detection, feature list generation, orientation com-putation and visualization on GPU. This implementationalso provides flexibility of changing many parameters bygenerating shaders on the fly. Almost all the parameters insift++ are ported.

In the following days, I’ll first finish the descriptor gen-eration, sub-pixel localization, and then SIFT matching onGPU. The packed image format in [4] may also need a try.The codes also needs to be optimized to make it a good li-brary.

6 Acknowledgements

I acknowledge Sudipta Sinha for many helpful discus-sions before I started this project and during the develop-ment. Also thank Florian Erik Muecke for giving me somehelpful tips and sharing of his work.

References

[1] http://gpgpu.org/.[2] http://www.gpgpu.org/w/index.php/glossary.[3] M. Brown and D. G. Lowe. Recognishing panoramas. In

International Conference on Computer Vision, pages 1218–1225, 2003.

4

[4] S. Heymann, K. Mller, A. Smolic, B. Froehlich, and T. Wie-gand. SIFT implementation and optimization for general-purpose gpu. In 15th International Conference in CentralEurope on Computer Graphics, Visualization and ComputerVision (WSCG’07), January 2007.

[5] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision, 60,November 2004. http://www.cs.ubc.ca/ lowe/keypoints/.

[6] S. N. Sinha, J.-M. Frahm, M. Pollefeys, and Y. Genc. Featuretracking and matching in video using programmable graphicshardware. Machine Vision and Applications, March 2007.

[7] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: ex-ploring photo collections in 3d. In SIGGRAPH ’06: ACMSIGGRAPH 2006 Papers, pages 835–846, 2006.

[8] A. Vedaldi. sift++,http://vision.ucla.edu/ vedaldi/code/siftpp/siftpp.html.

[9] G. Ziegler, A. Tevs, C. Theobalt, and H.-P. Seidel. GPU pointlist generation through histogram pyramids. In Technical Re-port, June 2006.

5

sift gpu

Documents