on an improved fpga implementation of cnn-based gabor-type filters

5
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 11, NOVEMBER 2012 815 On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters Evren Cesur, Member, IEEE, Nerhun Yildiz, Member, IEEE, and Vedat Tavsanoglu, Senior Member, IEEE Abstract—In this brief, the details of the architecture of a previously introduced improved field-programmable gate array implementation of the cellular neural network (CNN)-based 2-D Gabor-type filter are given, and the implementation results are discussed. The proposed architecture is suitable for real-time applications with high pixel rates. The prototype is capable of processing video streams up to a pixel rate of 373.2 megapixels per second (MP/s), including full-high-definition (HD) 1080p@60 (1080 × 1920 resolution, 60-Hz frame rate, and 124.4-MP/s visible pixel rate). This brief also contains convergence rate analysis results, along with some discussions on FIR and CNN-based im- plementation methods. Index Terms—Cellular neural networks (CNNs), field- programmable gate arrays (FPGAs), Gabor filters, real-time systems, reconfigurable architectures. I. I NTRODUCTION A 2-D GABOR filter is a spatial bandpass filter whose frequency- and orientation-selectivity properties are simi- lar to those of the human visual system. In the literature, FIR [1]–[4] and cellular neural network (CNN)-based [5] digital implementation methods for Gabor filters are available, where [1] and [4] demonstrated graphics processing unit (GPU) and application-specific integrated circuit implementations, respec- tively, whereas [2], [3], and [5] demonstrated implementations on field-programmable gate array (FPGA) devices, with all having their weaknesses and strengths. Previously, in search of an optimized design, two Gabor-type filter (GTF) architectures based on the CNN structure were proposed [6], [7], of which the latter was implemented on an FPGA device, where input/output, RAM, control, and serial interface structures of a real-time CNN emulator (RTCNNP-v2) [8], [9] were used. In this brief, a mathematical overview, the fixed-point arith- metic and convergence rate analysis results, and the architec- tural details of the GTF are given; the implementation issues of the FIR and CNN-based structures are discussed; and the real- ized prototype is compared with other digital implementations of Gabor filters. Manuscript received April 27, 2012; revised July 1, 2012; accepted August 26, 2012. Date of publication October 26, 2012; date of current version January 4, 2013. This work was supported by The Scientific and Technological Research Council of Turkey (TÜB ˙ ITAK) under Project 108E023. This brief was recommended by Associate Editor B.-D. Liu. The authors are with the Electronics and Communications Engineering Department, Yildiz Technical University, 34220 Istanbul, Turkey (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TCSII.2012.2218471 II. MATHEMATICAL OVERVIEW A one-neighborhood space-invariant continuous-time linear CNN with K × L cells is completely described in [10] by the cell state equation ˙ x ij (t)= 1 k=1 1 l=1 (a kl x i+kj+l (t)+ b kl u i+kj+l ) (1) where (i, j ), i ∈{1, 2,...,K}, j ∈{1, 2,...,L} denotes the spatial Cartesian coordinates of cell C (i, j ), x ij (t) is the cell state at time t, u ij is the constant-valued cell input, and a kl and b kl , k,l ∈ {−1, 0, 1} are the constant-valued feedback and input coefficients, respectively. Equation (1) can be written as ˙ x ij (t)= A X ij (t)+ B U ij (2) where is the template dot product operator and A = a 1 1 a 10 a 11 a 0 1 a 00 a 01 a 1 1 a 10 a 11 X ij (t)= x i1 j1 (t) x i1 j (t) x i1 j+1 (t) x ij1 (t) x ij (t) x ij+1 (t) x i+1 j1 (t) x i+1 j (t) x i+1 j+1 (t) B = b 1 1 b 10 b 11 b 0 1 b 00 b 01 b 1 1 b 10 b 11 U ij = u i1 j1 u i1 j u i1 j+1 u ij1 u ij u ij+1 u i+1 j1 u i+1 j u i+1 j+1 . The templates of a GTF are defined by Shi [11] as A = 0 e y0 0 e x0 (4 + λ 2 ) e x0 0 e y0 0 B = 0 0 0 0 λ 2 0 0 0 0 where λ is the half bandwidth of the corresponding bandpass filter, which is tuned to the frequency ω 0 =(ω 2 x0 + ω 2 y0 ) 1/2 . Now, applying x ij (t)= x ij (nT s ) Δ = x ij (n), n N, to (2) with the optimum sampling interval T s = |a 00 | 1 = (4 + λ 2 ) 1 as proposed in [12], along with the forward Euler approximation ˙ x ij (n) T 1 s (x ij (n + 1) x ij (n)), the cell state equation of a discrete-time CNN (DT-CNN) cell is ob- tained as x ij (n + 1) = T s ˜ A X ij (n)+ B U ij (3) 1549-7747/$31.00 © 2012 IEEE

Upload: vedat

Post on 09-Dec-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 11, NOVEMBER 2012 815

On an Improved FPGA Implementation ofCNN-Based Gabor-Type Filters

Evren Cesur, Member, IEEE, Nerhun Yildiz, Member, IEEE, and Vedat Tavsanoglu, Senior Member, IEEE

Abstract—In this brief, the details of the architecture of apreviously introduced improved field-programmable gate arrayimplementation of the cellular neural network (CNN)-based 2-DGabor-type filter are given, and the implementation results arediscussed. The proposed architecture is suitable for real-timeapplications with high pixel rates. The prototype is capable ofprocessing video streams up to a pixel rate of 373.2 megapixelsper second (MP/s), including full-high-definition (HD) 1080p@60(1080 × 1920 resolution, 60-Hz frame rate, and 124.4-MP/s visiblepixel rate). This brief also contains convergence rate analysisresults, along with some discussions on FIR and CNN-based im-plementation methods.

Index Terms—Cellular neural networks (CNNs), field-programmable gate arrays (FPGAs), Gabor filters, real-timesystems, reconfigurable architectures.

I. INTRODUCTION

A 2-D GABOR filter is a spatial bandpass filter whosefrequency- and orientation-selectivity properties are simi-

lar to those of the human visual system. In the literature, FIR[1]–[4] and cellular neural network (CNN)-based [5] digitalimplementation methods for Gabor filters are available, where[1] and [4] demonstrated graphics processing unit (GPU) andapplication-specific integrated circuit implementations, respec-tively, whereas [2], [3], and [5] demonstrated implementationson field-programmable gate array (FPGA) devices, with allhaving their weaknesses and strengths.

Previously, in search of an optimized design, two Gabor-typefilter (GTF) architectures based on the CNN structure wereproposed [6], [7], of which the latter was implemented on anFPGA device, where input/output, RAM, control, and serialinterface structures of a real-time CNN emulator (RTCNNP-v2)[8], [9] were used.

In this brief, a mathematical overview, the fixed-point arith-metic and convergence rate analysis results, and the architec-tural details of the GTF are given; the implementation issues ofthe FIR and CNN-based structures are discussed; and the real-ized prototype is compared with other digital implementationsof Gabor filters.

Manuscript received April 27, 2012; revised July 1, 2012; acceptedAugust 26, 2012. Date of publication October 26, 2012; date of current versionJanuary 4, 2013. This work was supported by The Scientific and TechnologicalResearch Council of Turkey (TÜBITAK) under Project 108E023. This briefwas recommended by Associate Editor B.-D. Liu.

The authors are with the Electronics and Communications EngineeringDepartment, Yildiz Technical University, 34220 Istanbul, Turkey (e-mail:[email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSII.2012.2218471

II. MATHEMATICAL OVERVIEW

A one-neighborhood space-invariant continuous-time linearCNN with K × L cells is completely described in [10] by thecell state equation

xij(t) =

1∑k=−1

1∑l=−1

(aklxi+k j+l(t) + bklui+k j+l) (1)

where (i, j), i ∈ {1, 2, . . . ,K}, j ∈ {1, 2, . . . , L} denotes thespatial Cartesian coordinates of cell C(i, j), xij(t) is the cellstate at time t, uij is the constant-valued cell input, and akland bkl, k, l ∈ {−1, 0, 1} are the constant-valued feedback andinput coefficients, respectively. Equation (1) can be written as

xij(t) = A�Xij(t) +B �U ij (2)

where � is the template dot product operator and

A =

⎡⎣ a−1 −1 a−1 0 a−1 1

a0 −1 a00 a01a1 −1 a10 a11

⎤⎦

Xij(t) =

⎡⎣ xi−1 j−1(t) xi−1 j(t) xi−1 j+1(t)

xi j−1(t) xij(t) xi j+1(t)xi+1 j−1(t) xi+1 j(t) xi+1 j+1(t)

⎤⎦

B =

⎡⎣ b−1 −1 b−1 0 b−1 1

b0 −1 b00 b01b1 −1 b10 b11

⎤⎦

U ij =

⎡⎣ ui−1 j−1 ui−1 j ui−1 j+1

ui j−1 uij ui j+1

ui+1 j−1 ui+1 j ui+1 j+1

⎤⎦ .

The templates of a GTF are defined by Shi [11] as

A =

⎡⎣ 0 e−jωy0 0ejωx0 −(4 + λ2) e−jωx0

0 ejωy0 0

⎤⎦ B =

⎡⎣ 0 0 00 λ2 00 0 0

⎤⎦

where λ is the half bandwidth of the corresponding bandpassfilter, which is tuned to the frequency ω0 = (ω2

x0 + ω2y0)

1/2.

Now, applying xij(t) = xij(nTs)Δ= xij(n), n ∈ N, to

(2) with the optimum sampling interval Ts = |a00|−1 =

(4 + λ2)−1 as proposed in [12], along with the forward Euler

approximation xij(n) � T−1s (xij(n+ 1)− xij(n)), the cell

state equation of a discrete-time CNN (DT-CNN) cell is ob-tained as

xij(n+ 1) = Ts

(A�Xij(n) +B �U ij

)(3)

1549-7747/$31.00 © 2012 IEEE

Page 2: On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters

816 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 11, NOVEMBER 2012

where the surround template A is

A =

⎡⎣ 0 e−jωy0 0ejωx0 0 e−jωx0

0 ejωy0 0

⎤⎦ .

Equation (3) can be rewritten as real and imaginary coupled cellstate equations like

xRij(n+1) =Ts

(A

R �XRij(n)−A

I �XIij(n)+B �U ij

)(4a)

xIij(n+1) =Ts

(A

R �XIij(n)+A

I �XRij(n)

)(4b)

where

AR=

⎡⎣ 0 cosωy0 0cosωx0 0 cosωx0

0 cosωy0 0

⎤⎦

AI=

⎡⎣ 0 − sinωy0 0sinωx0 0 − sinωx0

0 sinωy0 0

⎤⎦ .

A. Dynamic Range and Convergence Rate Issues

In order to avoid overflows in the implementations of thefixed-point arithmetic, the dynamic range of the GTF should bedetermined and taken into account in the design process. On theother hand, the number of Euler iterations should be constrainedfor the implementation. In this section, two lemmas are givento this end.

Lemma 1: If all inputs and Euclidean norms of all initialstates are in the range of [−1, 1], then the values of allstates also remain in the same range during the computation,i.e., ∀i, j, n, if |uij | ≤ 1 and ‖xij(0)‖ ≤ 1, where xij(n) =

[xRij(n) x

Iij(n)]

Tand ‖ · ‖ denotes the Euclidean norm, then∣∣xRij(n)

∣∣ ≤ 1∣∣xI

ij(n)∣∣ ≤ 1. (5)

Proof: Let us rewrite (4) into the form

xij(n+ 1) = Ts

{Gxxi j−1(n) +Gt

xxi j+1(n)

+ Gtyxi−1 j(n) +Gyxi+1 j(n) + λ2uij

}(6)

where

Gκ=

[cosωκ0 − sinωκ0

sinωκ0 cosωκ0

]xij(n)=

[xRij(n)

xIij(n)

]uij=

[uij

0

]

with κ ∈ {x, y}. From (6), we obtain

‖xij(n+ 1)‖

= (4 + λ2)−1 ∥∥Gxxi j−1(n) +Gt

xxi j+1(n)

+Gtyxi−1 j(n) +Gyxi+1 j(n) + λ2uij

∥∥≤ (4 + λ2)

−1 {‖Gxxi j−1(n)‖+∥∥Gt

xxi j+1(n)∥∥

+∥∥Gt

yxi−1 j(n)∥∥+ ‖Gyxi+1 j(n)‖

+λ2‖uij‖}.

Fig. 1. Signal-flow diagram of the discrete-time GTF.

By assuming that |uij | ≤ 1 and ‖xij(0)‖ ≤ 1 and consideringthat ‖uij‖ = |uij | and ‖Gκxija(n)‖ = ‖xij(n)‖, we have

‖xij(n+ 1)‖ ≤ (4 + λ2)−1{‖xi j−1(n)‖+ ‖xi j+1(n)‖+ ‖xi−1 j(n)‖+ ‖xi+1 j(n)‖+ λ2|uij |

}

≤ (4 + λ2)−1(4 + λ2) = 1

which means that ∀i, j, n, |xRij(n)| ≤ 1 and |xI

ij(n)| ≤ 1. �Lemma 2: The number of Euler iterations required for the

cell states to reach their steady-state values is inversely propor-tional to the bandwidth of the GTF.

Proof: The vector-matrix form of the cell state equationof the DT-CNN-based GTF is obtained by packing (3) asx(n+ 1) = Ax(n) + Bu, where u and x are the packed inputand state vectors while B and A are the corresponding inputand feedback matrices, respectively [13]. Note that all nonzeroentries of the matrix A are the entries a0 −1, a0 1, a−1 0, anda1 0 of the template A, multiplied by Ts, and each entry appearson a row at most once. Consequently, the maximum absoluterow sum norm of the matrix A is ‖A‖∞ = maxξ

∑KLζ=1 |aξζ | =

4/(4 + λ2) and, according to the Gershgorin circle theo-rem [14], |σρ|max ≤ 4/(4 + λ2), where |σρ|max is the spec-tral radius of A whose eigenvalues are given as σρ, ρ ∈{1, 2, . . . ,KL}. Hence, after certain numbers of iterations, allentries of x(n) converge to fixed values. From the above, it isalso easily seen that the larger the bandwidth λ, the smaller thespectral radius and, hence, the faster the convergence rate orvice versa. �

III. PROPOSED ARCHITECTURE

A DT-CNN-based GTF structure was proposed in [7] as animplementation of (6), whose signal-flow diagram is shown inFig. 1, where

αx =cosωx0

4 + λ2βx =

cosωy0

4 + λ2

αy =sinωx0

4 + λ2βy =

sinωy0

4 + λ2b =

λ2

4 + λ2.

Two different approaches are used in the logical realization ofthis diagram: direct implementation and resource sharing. Inthe former, a computation tree is implemented directly and, inthe latter, considering the similarity of the two diagrams, one

Page 3: On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters

CESUR et al.: ON AN IMPROVED FPGA IMPLEMENTATION OF CNN-BASED GABOR-TYPE FILTERS 817

Fig. 2. Logical block diagram of one CNN-based GTF computation unit.

Fig. 3. Simplified block diagram of the GTF processor array. One Gabor processor is dedicated to each iteration to form a full pipeline.

computation tree is realized and multiplexed for the compu-tation of real and imaginary parts (Fig. 2). The multiplexinghalves the number of multiplier and adder resources; however,twice the pixel frequency is required for computation. The inputbuij is also multiplexed with zero, as it is only required for thecalculation of the real part. All adders and multipliers are reg-istered to form a pipeline. The latency of the processor is threepixel clock cycles as there are six pipe stages. Finally, the realand imaginary parts of the resulting values are demultiplexed totheir respective output registers.

In order to use fewer resources, fixed-point arithmetic is usedat all levels of the implementation. In addition to the assumptionthat |uij | ≤ 1 in Lemma 1, it is also clear that b = λ2/(4 +λ2) ≤ 1; hence, buij remains in the range [−1, 1]. Using thisresult along with (5), we can conclude that 1 bit is sufficient torepresent the integer parts of each cell state and buij , while thebit widths of the fractional parts are designed to be presynthesisconfigurable.

The simplified block diagram of the GTF is shown in Fig. 3.The input image is used to generate the initial values of the cellstates, as well as to calculate the buij values. The initial valuescan be programmed to be equal to either their correspondingcell inputs or any constant value. Each iteration is implementedwith a dedicated processor, forming a fully pipelined processorchain. The state output of the N th processor is the final output,where N is the total number of implemented iterations. As thebuij values remain constant through the Euler iterations, theyare calculated only once for each input pixel and then buffered,used, and passed through by each processor in the chain.

The processor chain is designed to work with progressivevideo signals. As the upper and lower neighbors of each stateare required for a template dot product operation, each proces-sor requires at least two lines of its input states to be buffered inits dedicated line buffers [8], [9]. Consequently, the latency ofthe processor chain is approximately equal to the time requiredto scan 2N progressive video lines. Also, a single line of buij

constants should be buffered for synchronization.The bit widths, resource optimization level, depth of line

buffers, and number of processors are presynthesis configurable

and cannot be changed during runtime. On the other hand, thevalues of α, β, and λ, the boundary conditions, and the initialvalues are runtime programmable through a serial interface.Consequently, the proposed architecture is highly flexible, re-configurable, and reusable.

IV. IMPLEMENTATION ISSUES OF THE FIR GABOR FILTER

AND DT-CNN-BASED GTF

The hardware multipliers and internal RAM blocks are themost valuable resources of an FPGA device for the implemen-tation of Gabor filters. On the other hand, any hardware im-plementation is desired to be flexible, modular, reconfigurable,and reusable. In this section, the FPGA implementation issuesof the FIR Gabor filter and the DT-CNN-based GTF structureswill be discussed from the viewpoint of these facts.

The W ×W , W ∈ N, convolution kernel of a FIR Gabor fil-ter is obtained by sampling and windowing a continuous-space2-D Gabor function. For real-valued input images, computationof each complex-valued output pixel requires W 2 complex or2W 2 real multiplication operations. It is reported in [15] that Wshould be chosen as W ≥ 6λ−1 + 1 in order to eliminate theeffect of windowing; hence, the kernel should be larger for nar-row bandwidths, which, in turn, means that more multiplicationoperations are required. According to our simulation results, thekernel size may be loosened to W ≥ 4λ−1 + 1 = Wmin, andstill, more than 99% of the total energy of the original functionis carried by the windowed function. The converse is also true:If a FIR filter is realized with a W ×W convolution kernel, theinequality λ ≥ 4(W − 1)−1 should be satisfied. For example,when using a FIR implementation with a 21 × 21 kernel, it isnot possible to obtain a bandwidth smaller than 0.2.

Lemma 2 states that the number of Euler iterations requiredfor the computation of a DT-CNN-based GTF is inverselyproportional to the bandwidth; hence, like FIR filters, DT-CNN-based implementations also require more resources for smallerbandwidths. Furthermore, the convergence rate depends alsoon the texture of the input image and the initial values of thecell states. In order to investigate these issues, a PC simulation

Page 4: On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters

818 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 11, NOVEMBER 2012

Fig. 4. SSIM values versus the number of Euler iterations.

TABLE IRESOURCE USAGE OF A FIR GABOR FILTER AND THE PROPOSED GTF

is run with various inputs, initial states, and coefficients, wherethe cell state values at each Euler iteration are compared withthe steady-state values of the cell states using the StructuralSIMilarity (SSIM) index [16]. Here, the steady-state values ofthe cell states are calculated by xij(∞) = F−1{−(F{bij}/F{aij})F{uij}}, where F{·} and F−1{·} are the spatialFourier and inverse Fourier transform operators, respectively[10]. The simulation results are shown in Fig. 4, where boththe input and the initial cell state images are chosen as thesame sinusoidal zone plate of size 1080 × 1920. The curvesmarked as 1 to 6 are plotted for (λ, ω0) pairs of (0.02, π/9),(0.02, π/3), (0.1, π/9), (0.1, π/3), (0.3, π/9), and (0.3, π/3),respectively. It is observed that an SSIM index value of 0.96 issufficient for a Euler iteration result to be sufficiently close tothe steady-state image.

Some resource usage statistics for the FIR Gabor filters andthe DT-CNN-based GTF implementations are given in Table I.These results show that almost the same numbers of multipliersare required for both designs for a bandwidth constraint ofλ ≥ 0.2; however, multiplier usage of a FIR filter increasesdrastically as λ decreases. On the other hand, GTF imple-mentations are much more RAM intensive than FIR filters.The 9-kbit block RAM usage statistics given in Table I arecalculated for a full-HD input, where the bit widths of uij ,xRij(n+ 1), xI

ij(n+ 1), and buij are chosen as 8, 12, 12, and16, respectively.

Another issue is the level of difficulty of physical placementand routing on an FPGA. A GTF implementation contains smallpipelined processors, e.g., containing eight multipliers each,and only neighboring processors are connected to each other;hence, the placement and routing tasks of a GTF are trivialfor an FPGA design tool. On the other hand, a FIR filter isessentially a single huge processor, containing at least hundredsof multipliers, which means that the same task is consider-ably harder. Therefore, the FPGA design tool not only shouldconfigure the scattered multiplier and RAM resources on thechip but also needs to synchronize them with a single clock.

As a result of the previous discussions, FIR filters may bepreferred for λ ≥ 0.2, whereas the DT-CNN-based GTF struc-

Fig. 5. PSNR values versus fixed-point bit widths.

tures are suitable for any bandwidth, provided that a sufficientlylarge FPGA is used. The GTF is, by far, flexible, modular,reconfigurable, and reusable, as its implementation on an FPGAfor a smaller bandwidth means using more processors, whichis simply accomplished by changing a single number in thevery-high-speed integrated circuits (VHSIC) hardware descrip-tion language (VHDL) source code and resynthesizing thedesign.

V. IMPLEMENTATION RESULTS AND COMPARISONS

The proposed architecture is implemented on an AlteraStratix IV GX 230 FPGA device. The input and output ofthe system are chosen to be Digital Visual Interface (DVI) orHigh-Definition Multimedia Interface (HDMI). The processingstarts with the capture of a progressive video signal from aDVI/HDMI source, like a PC graphics card, a digital videocamera, or a media player. Then, the captured data are convertedto a gray-scale image and processed with the implemented GTF.Finally, the output is relayed to a DVI/HDMI video sink, e.g., aPC monitor or a high-definition television.

The prototype is tested for various resolutions from 480 ×640 to 1080 × 1920, frame rates from 50 to 85 Hz, and numbersof iterations from 1 to 100. Due to input/output limitations, themaximum throughput tested is 124.4 megapixels per second(MP/s) for full-HD 1080p@60 (1080 × 1920 resolution at a60-Hz frame rate). However, the processor chain is capableof processing at 373.2 MP/s when resource optimization isdisabled. For full-HD 1080p@60 and 50 iterations, there are2200 pixels on a scan line and the pixel clock frequency is148.5 MHz; hence, the latency of the system is approximately2× 50× 2200/148.5 MHz = 1.48 ms.

The rounding error of the system depends on the bit widthsof the states and buij . We use peak signal-to-noise ratio (PSNR)as a measure of this error, which is plotted against the bit widthvalues in Fig. 5, where the curves marked as 1 to 3 correspondto (λ, ω0) pairs of (0.02, π/9), (0.02, π/3), and (0.1, π/9) and“R” and “I” denote the real and imaginary parts of the outputs,respectively. The input image is chosen as a full-HD sinusoidalzone plate with 8-bit intensity. The initial values of the statesare set to be equal to the corresponding values of the input and50 Euler iterations are computed. First, a reference output iscalculated, where double-precision floating-point numbers areused. Second, a fixed-point PC simulation of the GTF is run,where the bit widths of xR

ij(n) and xIij(n) are swept while α, β,

and b are fixed to 16 bits and buij is fixed to 24 bits [Fig. 5(a)].Third, xR

ij(n) and xIij(n) are fixed to 24 bits while buij is swept

[Fig. 5(b)]. The analysis results show that choosing 16 bits for

Page 5: On an Improved FPGA Implementation of CNN-Based Gabor-Type Filters

CESUR et al.: ON AN IMPROVED FPGA IMPLEMENTATION OF CNN-BASED GABOR-TYPE FILTERS 819

TABLE IICOMPARISON OF GABOR FILTER IMPLEMENTATIONS

both states and buij is sufficient to achieve more than 60-dBPSNR values.

The proposed architecture is compared with other imple-mentations reported in the literature (Table II). It is ratherdifficult to compare various architectures implemented ondifferent technologies, running at different processing frequen-cies. Therefore, we use throughput per hertz (throughput di-vided by frequency) for scaling. The results show that, whenresource optimization is disabled, our GTF implementation istwo times faster than the GPU implementation reported byWang and Shi [1] (including input/output delay), which is thefastest 2-D Gabor filter implementation reported to date. Theproposed GTF is also approximately 4.2 times faster than thatgiven in [5] and 2.27, 4.94, and 42 times faster than the FPGAimplementations of 2-D FIR Gabor filters reported in [2]–[4], respectively. Enabling resource optimization halves thesespeedup values, but our GTF implementation is still as fast asthe one given in [1] and faster than all others.

VI. CONCLUSION

In this brief, first, the GTF architecture proposed in [7] hasbeen described in more detail, where fixed-point arithmeticand convergence rate analysis results have been also given.Second, the implementation issues of the FIR Gabor filter andDT-CNN-based GTF implementations have been discussed.It is shown that smaller bandwidths require more multiplierand block RAM resources for both designs, and placementand routing of the FIR implementations are rather difficult forthe FPGA design tools. Third, the proposed GTF has beencompared with other implementations of Gabor filters. Theprototype was tested on full-HD 1080p@60 (124.4 MP/s) videostreams, which is the highest resolution and fastest through-put per hertz reported to date. The maximum throughput ofthe implementation is more than 373.2 MP/s with a suitableinterface.

REFERENCES

[1] X. Wang and B. Shi, “GPU implemention of fast Gabor filters,” in Proc.IEEE ISCAS, May 30–Jun. 2, 2010, pp. 373–376.

[2] E. Norouznezhad, A. Bigdeli, A. Postula, and B. Lovell, “Robust objecttracking using local oriented energy features and its hardware/softwareimplementation,” in Proc. 11th ICARCV , Dec. 2010, pp. 2060–2066.

[3] Y. Cho, S. Bae, Y. Jin, K. Irick, and V. Narayanan, “Exploring Gabor filterimplementations for visual cortex modeling on FPGA,” in Proc. Int. Conf.FPL, Sep. 2011, pp. 311–316.

[4] J. Liu, S. Wang, Y. Li, J. Han, and X. Zeng, “Configurable pipelined Gaborfilter implementation for fingerprint image enhancement,” in Proc. 10thIEEE ICSICT , Nov. 2010, pp. 584–586.

[5] O. Cheung, P. Leong, E. Tsang, and B. Shi, “A scalable FPGA imple-mentation of cellular neural networks for Gabor-type filtering,” in Proc.IJCNN, 2006, pp. 15–20.

[6] E. Saatci, E. Cesur, V. Tavsanoglu, and I. Kale, “An FPGA implementa-tion of 2-D CNN Gabor-type filter,” in Proc. 18th ECCTD, Aug. 2007,pp. 280–283.

[7] E. Cesur, N. Yildiz, and V. Tavsanoglu, “An improved FPGA implemen-tation of CNN Gabor-type filters,” in Proc. IEEE ISCAS, May 2011,pp. 881–884.

[8] N. Yildiz, E. Cesur, and V. Tavsanoglu, “A new control structure for thepipelined CNN processor arrays,” in Proc. 12th Int. Workshop CNNA,Feb. 2010, pp. 1–4.

[9] E. Cesur, N. Yildiz, and V. Tavsanoglu, “Architecture of the next genera-tion real time CNN processor: RTCNNP-v2,” in Proc. Int. Symp. NOLTA,Krakow, Poland, Sep. 2010.

[10] L. Chua and T. Roska, Cellular Neural Networks and Visual Computing:Foundation and Applications. Cambridge, U.K.: Cambridge Univ. Press,2002.

[11] B. Shi, “Gabor-type filtering in space and time with cellular neural net-works,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 45, no. 2,pp. 121–132, Feb. 1998.

[12] E. Saatci and V. Tavsanoglu, “On the optimal choice of integration time-step for raster simulation of a CNN for gray level image processing,” inProc. IEEE Int. Symp. ISCAS, 2002, vol. 1, pp. I-625–I-628.

[13] V. Tavsanoglu, “Jacobi’s iterative method for solving linear equationsand the simulation of linear CNN,” in Proc. 10th Int. Workshop CNNA,Aug. 2006, pp. 1–5.

[14] R. Horn and C. Johnson, Matrix Analysis. Cambridge, U.K.: CambridgeUniv. Press, 1990.

[15] D. Dunn and W. Higgins, “Optimal Gabor filters for texture segmenta-tion,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 947–964, Jul. 1995.

[16] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess-ment: From error visibility to structural similarity,” IEEE Trans. ImageProcess., vol. 13, no. 4, pp. 600–612, Apr. 2004.