inertia based filtering of high resolution images using a gpu cluster

Comput Visual Sci (2011) 14:181–186DOI 10.1007/s00791-012-0171-2

Inertia based filtering of high resolution images using a GPUcluster

Daniel Jungblut · Gillian Queisser · Gabriel Wittum

Received: 6 October 2010 / Accepted: 5 October 2011 / Published online: 26 January 2012© Springer-Verlag 2012

Abstract The scheme of inertia based anisotropic diffu-sion is a very powerful noise reducing and structure pre-serving image processing operator. This paper presents animplementation of this time consuming filter process on acluster of Nvidia Tesla high performance computing proces-sors, which can be applied to very large amounts of data inonly a few minutes. Applying the inertia based diffusion fil-ter to high resolution image stacks of neuron cells providesfully automatic geometric reconstructions of these imageson a scale of <1µm. Such a high throughput and automaticimage processing tool has great impact on various researchareas, in particular the fast growing field of computationalneuroscience, where one encounters increasing amount ofmicroscopy data that needs to be processed.

Keywords Inertia based · Anisotropic · Diffusion ·Filtering · Reconstruction · High resolution images · GPU ·CUDA

1 Introduction

Linking the different scientific disciplines of biology, neu-roscience, mathematics and scientific computing is a toughchallenge. Realistic 3D-morphologies connect biological

Communicated by Martin Rumpf.

D. Jungblut (B) · G. Queisser · G. WittumGoethe-Center for Scientific Computing (G-CSC), Goethe-University,Kettenhofweg 139, 60325 Frankfurt am Main, Germanye-mail: [email protected]

G. Queissere-mail: [email protected]

G. Wittume-mail: [email protected]

data with scientific computing and therefore form a conve-nient bridge between these disciplines. The automatic neu-ron reconstruction algorithm NeuRA [2] is a good approachto achieve such a connection. In order to apply the power-ful reconstruction algorithm to high resolution microscopyimages, a fast implementation of this software is required.The scheme of inertia based anisotropic diffusion filtering(see Sect. 2 for details), introduced as nonlinear anisotropicdiffusion in [3], is successfully used as a preprocessing stepin the NeuRA algorithm for reconstructing the morphol-ogy of neuron cells [3] and nuclei from hippocampal neu-rons [19] from 2-photon or confocal microscopy images.Among other methods, like median filtering [8], waveletmethods [24], nonseparable filterbanks [20] or multiscaleenhancement techniques [6], inertia based anisotropic filter-ing massively reduces the noise of these images while pre-serving existing structures [3,10,11,19]. The inertia basedstructure detection also turned out to be more powerful thanthe anisotropic diffusion filtering schemes of Perona-Malik[18] and Weickert [23], as shown in [19]. However, the fil-tering process is the most time consuming step of NeuRA, asdescribed in [3] and [19]. Using highly parallelized graphicshardware enables the filter to operate on huge data sets in areasonable time. State of the art Nvidia Tesla graphic process-ing units (GPUs) achieve a peak performance of one Teraflopin single precision calculations [16], which is sufficientlyaccurate for image processing purposes like inertia basedanisotropic diffusion filtering. The Nvidia Tesla processors,as well as other Nvidia GPUs, can be programmed usingthe Nvidia CUDA (Compute Unified Device Architecture)technology [14]. Dividing large images in small subcubesand distributing them among many Tesla processors enablesthe filter to process images of almost arbitrary size within afew minutes (Sect. 3). Large microscopy data means a veryhigh resolution of the recorded image, which requires an

123

182 D. Jungblut et al.

optimization of the structure detection input parameter of thefilter, discussed in Sect. 4. The concluding discussion givesan outlook how the other steps (segmentation and geometrygeneration) of the fully automatic morphology reconstructionalgorithm NeuRA can be redesigned to run on a Tesla-basedcluster to perform those reconstructions of high resolutionmicroscopy data. Compared to other semi-automatic recon-struction algorithms [21], this is another great advantage ofNeuRA.

Besides other high performance computing solutions likeAMD’s Firestream [1], BrookGPU [4], Intel’s Ct [5] andthe upcoming hardware independent standard OpenCL [13],the Nvidia CUDA technology is one possibility to use themassive power of current graphic processing units (GPUs).Therefore, Nvidia offers a complete toolkit, containing a Ccompiler for Nvidia GPUs together with a runtime driver, aruntime library, a debugger and amongst others the higher-level libraries cuBLAS and cuFFT. Even though there aremore advanced GPUs available in the meantime, the Tesla-C1060 [16] technology was used for the research presentedin this paper.

2 Inertia based anisotropic diffusion filtering

As described in [3,10,19] the inertia based anisotropic dif-fusion filter allows diffusion along structures, but prohibitsdiffusion perpendicular to them. The filter enhances connec-tivity of vital structure, while preserving parameters, such assize or diameters. The filter scheme is defined by the bound-ary value problem

∂t u = ∇ · (D(u)∇u) on R+ × Ω (1)

u(x, 0) = u0 on Ω̄ (2)

(D(u)∇(u)) · n = 0 on R+ × ∂Ω (3)

where the spatial domain Ω is an open, bounded subset ofR

3 and u denotes the continous image function defined onΩ .

2.1 Structure detection

The diffusitivity tensor D(u) in voxel v is calculated by usingthe physical moments of inertia for a discrete mass distribu-tion [22]

DI :=1

M

∑

i

⎛

⎝u(i)

⎛

⎝y2

i + z2i −xi yi −xi zi

−yi xi x2i + z2

i −yi zi

−zi xi −zi yi x2i + y2

i

⎞

⎠

⎞

⎠ (4)

with

M :=∑

i

u(i) (5)

where u(i) denotes the gray value of the voxel vi and the sumis taken over all voxels vi with ‖v − vi‖∞ ≤ ρ. The size ρ

of the integration region has to be chosen carefully, since ithas a significant impact on the quality of the filtered image(Sect. 4).

2.2 Directing the diffusion

The moments of inertia DI are real, symmetric and positivedefinite matrices. Hence, they have real positive eigenvaluesλ1 ≤ λ2 ≤ λ3 and an orthonormal system of associatedeigenvectors µ1, µ2, µ3. The diffusitivity tensor D(u) is thendefined as

D(u) = S

⎛

⎝η1 0 00 η2 00 0 η3

⎞

⎠ ST (6)

with the matrix

S = (µ1, µ2, µ3

)(7)

composed by the eigenvectors and

η1 = 1, η2 = η3 = ε (8)

for linear structures,

η1 = η2 = 1, η3 = ε (9)

for planar structures and

η1 = η2 = η3 = 1 (10)

for isotropic structures [3,10,19]. An approach for dynami-cally directing the diffusion can be found in [11].

2.3 Solving the partial differential equation

As proposed in [3], the partial differential Eq. (2) is solvedby a finite volume spatial discretization and a semi-implicitBackward-Euler time discretization. The time discretizationyields two crucial parameters:

– τ : length of one timestep– nt : number of timesteps

An optimal choice of these parameters for applying the filterto 2-photon or confocal microscopy images is τ = 1 andnt = 4, as discussed in [3,10,19]. The resulting sparse linearsystem of equations is solved by a highly parallelized Jacobimethod [10]. Matrix A of the resulting linear system of equa-tions contains 27 non-zero elements per row [3]. Togetherwith one vector for the unknown x and one vector for theright hand side b, the amount of memory needed to store allthis information is at least 116 times higher, than the size ofthe data being processed, assuming a single precision calcu-lation and one byte storage per voxel. Current 2-photon and

123

Inertia based filtering of high resolution images using a GPU cluster 183

confocal microscopy provides images of single neuron cellsin a resolution of 2, 048×2, 048×368 voxels, stored in fileswith a size of about 1.5 GB, for which the filter would requirealmost 200 GB of memory. One possibility to process suchlarge data amounts is to assemble the matrix elements on thefly, whenever they are needed. However, this approach leadsto an unjustifiable computational effort, because calculatingthe diffusitivity tensors and assembling the matrix is verytime consuming [3,10]. A better way to handle these hugedata sets, is by dividing them into small subcubes. Choos-ing these subcubes with an overlap of 16 voxels in everyspace direction, which exceeds the usually chosen structuredetection parameter ρ (Sect. 4), and applying a linear inter-polation in the overlap regions, this approach will not affectthe result of the image processing operator, since the filteringscheme only uses the local smoothing behaviour of the under-lying heat equation. Fortunately, this method allows a highlyparallelized implementation of the filter, by distributing thesubcubes among many devices.

3 GPU implementation

The input image is divided in blocks of dimension 8×4 ×4,each of which is considered as one thread block. Every sin-gle thread operates on one voxel of the image. Calculatingthe diffusitivity tensors and creating the system matrix canbe done in parallel without interaction between the singlethreads. The original image data is accessed via texture mem-ory cache during tensor integration, speeding up the algo-rithm significantly. For the Jacobi solver, an intelligent useof the shared memory inside the thread blocks yields anotherremarkable speed-up. First of all the runtimes between theCPU implementation and the GPU implementation are com-pared. Both implementations are highly optimized and aretherefore much faster than the original NeuRA implementa-tion [2].

3.1 Runtime comparison between CPU and GPUimplementation

Filtering the data set of dimension 256 × 256 × 216 pro-cessed in [3] with default parameters ρ = 5, τ = 1, nt = 4takes 113 min on one core of an Intel Xeon 3 GHz proces-sor, whereas one Nvidia Tesla C1060 GPU only needs 70 s,which is almost 100 times faster.

3.2 Using the texture memory for tensor ntegration

Global memory is not cached and has an access latency of400–600 clock cycles [14]. The read-only texture memory,however, has an access latency of only one clock cycle, ifthe required data is already cached [14]. Adjacent threads

Table 1 Comparison of runtimes when filtering a dataset of dimension256 × 256 × 216 [3], using one Tesla C1060 with parameters τ = 1,nt = 4, varying integration size ρ and accessing the input data viatexture memory and global memory

ρ Texture mem. (s) Global mem. (s)

3 45 47

5 70 76

8 156 180

10 260 306

need to access the same data while integrating the momentsof inertia. Using the texture memory for this part of the algo-rithm massively decreases the runtime, especially for largeintegration regions (Table 1). Since the overall runtime isdominated by the solver for small integrations regions, theusage of the texture memory for tensor integration only leadsto marginal speedups in these cases.

3.3 The Jacobi solver

Every call of the Jacobi kernel performs α Jacobi iterations[7] in which only data inside each thread block is synchro-nized via shared memory. After satisfying the condition∥∥∥Axk − b

∥∥∥2

< εd (11)

where xk denotes the vector of unknows after k Jacobi itera-tions, the system of equations is considered to be solved. TheJacobi kernel consists of three phases:

– In the data fetch phase, the data, on which the currentthread block operates on, plus one voxel in every spacedirection, is loaded to shared memory. At the end of thisphase all threads of the block are synchronized.

– In the iteration phase, α Jacobi iterations are performedon the data, stored in the shared memory.

– In the writeback phase, the processed data is stored backto global memory.

In the data fetch phase of the next Jacobi kernel call,the updated data from all single threads is loaded to sharedmemory and thus engineering a synchronization between thethreads of different thread blocks. The measured runtimesfor different choices of the synchronization parameter α arelisted in Table 2. At the end of the solving process, four addi-tional, fully synchronized, Jacobi iterations are performedto eliminate rounding errors at the boundaries of the threadblocks. Table 2 also verifies, that the cost of computations ona GPU is much lower than accessing data from global mem-ory [14]. However, choosing α > 4 causes rounding errorsat the boundaries of the thread blocks, damaging the resultof the filter.

123


Table 2 Time to solve a system of equations with 256 × 256 × 216 =14,155,776 unknowns with εd = 10−5 at one Tesla C1060 processorwith varying synchronization parameter α

α Runtime (s)

1 16

2 8.2

4 4.3

8 2.4

Table 3 Runtimes for filtering the data set of dimensions 2,048 ×2,048 × 368 with the common filter parameters τ = 1, nt = 4 andvarying integration size ρ using a cluster of 96 Tesla C1060 processors

ρ Runtime (s)

3 133

5 166

8 288

10 434

0

20

40

60

80

100

10 20 30 40 50 60 70 80 90

Spee

dup

fact

or

Number of Tesla processors

Fig. 1 Scalability. Blue optimal speedup. Green weak scaling. Redstrong scaling

3.4 Filtering huge data sets using a cluster of teslaprocessors

The data of dimension 2, 048×2, 048×368 mentioned aboveis split into 96 subcubes of dimension 360×272×200, includ-ing the overlap at the boundaries. Each subcube is filtered byone Tesla C1060 processor, consuming about two-thirds ofavailable device memory of each GPU. The total runtime ofthe filter, including loading and storing, as well as distribut-ing and gathering the data among the cluster and interpolat-ing inside the boundary regions of the subcubes, is listed inTable 3.

3.5 Scalability

Figure 1 shows the scalability of the proposed implemen-tation. For large data sets filtered on 96 Tesla processors a

speedup of 75 occurs for the weak scaling case (green graph),whereas a speedup of 62 occurs for the strong scaling case(blue graph).

4 Optimizing the tensor integration region size for highresolution microscopy data

Large data sets of single neuron cells imply a high resolu-tion of the recording microscope, requiring the adaption ofthe integration region ρ to detect small and filigreed struc-tures on the dendrites, called spines [25], properly. Choosingρ too small causes problems in distinguishing spines frombackground noise, since to little noise is suppressed by thefilter. Choosing ρ too big will cause the structure detec-tion to consider the spines as noise and eliminates them.ρ = 5 is an adequate choice to distinguish spines frombackground noise, while not destroying them (Fig. 2). Opti-mizing the segmentation and reconstruction methods of Ne-uRA for the application to 2-photon or confocal microscopyimages of neuron cells, and especially reconstructing spinesproperly, will be part of future research. A first result ofsuccessfully reconstructed spines near a cell body is shownin Fig. 3.

5 Discussion

State of the art GPU-based high performance computingsolutions allow to apply the time and memory intensivescheme of inertia based anisotropic filtering to huge data sets.For exploiting the computing power of the GPU efficiently,it is important to use the complex GPU memory architectureas well as possible. Suggestions how to utilize the texturememory and the shared memory were given in Sect. 3. Futureresearch will improve the applied segmentation schemes andtransfer the segmentation and geometry generation step ofthe NeuRA algorithm [2] to Nvidia CUDA technology toachieve morphology reconstructions of the available highresolution microscopy images quickly. The GPU implemen-tation of standard segmentation methods, like Otsu’s [17]statistical method, is straightforward and showed speedupsof about a factor 50 in a first test implementation. How-ever, parallelizing mesh generation methods, like the March-ing-Cubes-Algorithm [12], will be more difficult but a firstapproach of a possible GPU implementation can be foundin [15]. The runtime-improved implementation of the inertiabased anisotropic image filter is the first step of an intendedreconstruction software package, which will allow biologistsand neuroscientists to reconstruct their microscopy data fullyautomatically. Since the recording process of high resolutionmicroscopy images takes several hours, it will be suitable torun the reconstruction algorithm on an affordable worksta-

123

Inertia based filtering of high resolution images using a GPU cluster 185

(a) (b) (c) (d)

Fig. 2 To resolve small spinal extensions of the dendrites (marked)properly, the integration size parameter ρ has to be chosen carefully.a Dendrite in original microscopy data quality. b Filtered with ρ = 3:

the spines cannot be distinguished from noise. c Filtered with ρ = 5:spines can clearly be distinguished from noise. d Filtered with ρ = 8:some spines are considered as noise and therefore eliminated

(a) (b)

(c)

Fig. 3 a Part of the original microscope image (volumetric projection). b Filtered with parameters ρ = 5, τ = 1 and nt = 4. c Reconstruction ofthe filtered image with NeuRA. The correct reconstruction of spines (marked) is clearly visible

tion with up to four Nvidia Tesla processors with an esti-mated runtime proportional to microscopy recording times.Compared to semi-automatic reconstruction tools presentedin [21], which allow reconstruction of images with a simi-lar resolution like the ones presented here, NeuRA needs nouser interaction during the reconstruction process. Avoiding

model-based segmentation methods, NeuRA is not limitedto reconstructions of linear dendritic structures [9,19].

Acknowledgments We thank H. Monyer and J. v. Engelhardt fromIZN Heidelberg for providing raw microscopy data and for stimulatingdiscussions.

123


References

1. AMD: AMD Stream Computing User Guide. http://developer.amd.com/gpu_assets/ATI_Stream_SDK_CAL_Programming_Guide_v2.0.pdf. AMD (2010)

2. Broser, P.J., Eberhard, S., Heumann, H., Heusel, A., Jungblut, D.,Queisser, G., Schulte, R., Vossen, C., Wittum, G.: The NeuronReconstruction Algorithm. http://www.neura.org

3. Broser, P.J., Schulte, R., Roth, A., Helmchen, F., Waters, J.,Lang, S., Sakmann, B., Wittum, G.: Nonlinear Anisotropic Dif-fusion Filtering of Three-Dimensional Image Data from 2-PhotonMicroscopy. Heidelberg University, Heidelberg, J. Biomed. Opt.9(6):1253–1264 (2004)

4. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston,M., Hanrahan, P.: Brook for GPUs: Stream Computing on GraphicsHardware. SIGGRAPH, Stanford University (2004)

5. Ghuloum, A., Sprangle, E., Fang, J., Wu, G., Zhou, X.: Ct: A Flexi-ble Parallel Programming Model for Tera-scale Architectures. Intel(2007)

6. Frangi, A.F., Niessen, W.J.,Vincken, K.L., Viergever, M.A.: Mul-tiscale vessel enhancement filtering. Lecture Notes in ComputerSciences, vol. 1496, pp. 130–137. Springer, Berlin (1998)

7. Hackbusch, W.: Iterative Solution of Large Sparse Systems ofEquations. Springer, Berlin (1993)

8. Jaehne, B. : Digital Image Processing. Springer, Berlin (2005)9. Jungblut, D., Karl, S., Mara, H., Krömker, S., Wittum, G.: Sur-

face Morphology Reconstruction of Volume Data for ArchaeologyIn: Proceedings of conference. Scientific Computing and CulturalHeritage. Springer, Berlin (2010)

10. Jungblut, D.: Trägheitsbasiertes Filtern mikroskopischer Messda-ten unter Verwendung moderner Grafikhardware. Diploma thesis,Heidelberg University (2007)

11. Lenzen, F.: 3D-Rekonstruktion von DNA-Strukturen. Diploma the-sis, Bonn University (2001)

12. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution3D surface construction algorithm. Comput. Graph. 21, 163–169 (1987)

13. Munshi, A.: The OpenCL Specification. Khronos OpenCL Work-ing Group (2008)

14. Nvidia: Nvidia Cuda Programming Guide. Version 2.0 (2008)15. Nvidia: Nvidia Cuda Software Development Kit. Version 2.0

(2008)16. Nvidia: NVidia Tesla C1060 Computing processor bord. NVi-

dia (2010). www.nvidia.com/docs/IO/43395/BD-04111-001_v06.pdf

17. Otsu, N.: A threshold selection method from gray-level histograms.IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)

18. Perona, P., Malik, J.: Scale-Space and edge detection using aniso-tropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12(7):629–639 (1990)

19. Queisser, G., Bading, H., Wittmann, M., Wittum, G.: Filtering,reconstruction, and measurement of the geometry of nuclei fromhippocampal neurons based on confocal microscopy data. J. Bio-med. Opt. 13(1), 014009 (2008)

20. Santamaria-Pang, A., Bildea, T.S., Tan, S., Kakadiaris, I.A.: De-noising for 3-D photon-limited imaging data using nonseparablefilterbanks. IEEE Trans. Image Process. 17(12) (2008)

21. Schmitt, S., Evers, J.F., Duch, C., Scholz, M., Obermayer, K.: Newmethods for the computer-assisted 3D reconstruction of neuronsfrom confocal image stacks. Neuro Image 23, 1283–1298 (2004)

22. Taylor, J.R.: Classical Mechanics. University Science Books(2004)

23. Weickert, J.: Anisotropic Diffusion in Image Processing. Teub-ner, Stuttgart (1998)

24. Xu, Y., Weaver, J.B., Healy, Jr., D.M., Lu, J.: Wavelet transformdomain filters: A spatially selective noise filtration technique. IEEETrans. Image Process. 3, 747–758 (1994)

25. Yuste, R., Denk, W.: Dendritic spines as basic functional units ofneuronal integration. Nature 375, 682–684 (1995)

123

http://developer.amd.com/gpu_assets/ATI_Stream_SDK_CAL_Programming_Guide_v2.0.pdf



http://www.neura.org

www.nvidia.com/docs/IO/43395/BD-04111-001_v06.pdf

www.nvidia.com/docs/IO/43395/BD-04111-001_v06.pdf

inertia based filtering of high resolution images using a gpu cluster

Documents