center for machine perception fpga-accelerated sliding...

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY IN PRAGUE

RE

SE

AR

CH

RE

PO

RT

ISSN

1213-2365

FPGA-accelerated sliding windowclassifier with structured features

Ondrej Sychrovsky, Martin Matousek,Radim Sara

{sychro1,xmatousm,sara}@cmp.felk.cvut.cz

CTU–CMP–2013–16

June 2013

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/sychrovsky/Sychrovsky-TR-2013-16.pdf

This work was supported by the European Commission underinteractIVe, a large scale integrated project, part of the

FP7-ICT-246587 for Safety and Energy Efficiency in Mobility. Theauthors would like to thank all partners within interactIVe for their

support.

Research Reports of CMP, Czech Technical University in Prague, No. 16, 2013

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

FPGA-accelerated sliding window classifier with structured

features

Ondrej Sychrovsky, Martin Matousek, Radim Sara

June 2013

Abstract

There are certain classification tasks in computer vision that require the classifier responseto be computed in every pixel of an image. When combined with large, complex features, itbecomes really challenging to implement such a classifier on a standard PC architecture andachieve real-time performance.

We present an implementation of a car wheel classifier response computation pipelineon an FPGA, built as an instantiation of a generic classification system. An interestingoptimization problem concerning processing time and classification performance is addressed.Our implementation is running in real-time as a part of a more complex system based oncar-detection in video data.

This is a long version of our paper presented at the 23rd International Conference onField Programmable Logic and Applications, 2013, Porto, Portugal.

1 Introduction

A typical visual object detection task requires “computing something everywhere in the image”prior to a higher-level visual recognition and object detection processes. One of the simplestforms of this pre-processing uses a set of classifiers, one per primitive object, to compute responsemaps in all image pixels. The pre-processing then works as an operator transforming the image toanother image, which is then more suitable for building the object detector in a straightforwardway.

For instance, we may want to detect passenger cars in videos of crossing traffic based ona structured model that includes the wheels, the wheelbase, the A, B, C pillars etc. Given awheel detector it is possible to build a part-based model that constructs car hypotheses fromwheel pairs and verifies those hypotheses in the input image by “looking at model-predictedlocations” [1]. The wheel classifier and detector, Fig. 1, is therefore that part which has to berun everywhere in the image. When evaluating the detection likelihood ratio for the rest of themodel, the raw input image is needed but it is not necessary to access all its pixels. Althoughthis is not the only possibility to build such a detector, we concentrate in this paper on thisapproach. Our goal is to pre-compute wheel likelihood maps and use them in a car detector ina way we have just sketched.

In this paper we present an FPGA implementation of an object classifier. We believe oursolution is generic and can be used in most localized object detection tasks. In addition, toboost performance under diverse illuminations, our procedure includes learnable adaptive imagere-quantization, which is based on local image contrast.

We refer the reader to [1] for a more complete picture of the whole car detection subsystemwhich was integrated in a passenger car collision mitigation system and extensively field-tested.

1

In recent years, the need for real-time applications of object detection has risen, and manyhardware systems have been proposed for accelerating the object classification and detection.

Most common is the AdaBoost-based classifier adapted from the classification scheme byViola and Jones for face detection [2]. Majority of the papers use Haar-like features and focuson FPGA implementation of the integral image feature extraction and cascaded Haar featureclassification. Gao et al. [3] have implemented a 40-stage Haar classifier cascade on an FPGA,processing 16 features per stage simultaneously. Hiromoto et al. [4] compute features in the firstfew stages in parallel in order to speed up the initial decision process. The succeeding stagesare computed sequentially.

He et al. [5] also use Haar features, but they build a cascade of Artificial Neural Networkclassifiers. Another popular machine learning and detection tool are Support Vector Machines,also suitable for FPGA acceleration, as in [6], where again a cascading approach is used. Dueto huge amount of data necessary to process in traditional sliding window approaches, therewere also attempts to reduce the search; in [7] a hardware edge detector has been added thatdetermines whether the sample has enough information to be passed to the actual classificationcascade.

We utilize the AdaBoost-based framework for detection. This learning scheme has an ad-vantage (over e.g., SVM) that a criterion for selecting efficient features can be simply employed.In addition to standard training error, the utilization cost of a computing resource can be takeninto account.

We place emphasis on using more structured features than simple Haar-like ones; complex,general kernels that require full convolution to be computed. The properties of the bank ofkernels are given by an application, and can represent some knowledge about the detectedobjects that can be hard to learn automatically. In our application, we use rotationally invarianttemplates for wheel detection as will be discussed next.

We do not employ a classification cascade because we use fewer (but more complicated)features. The second reason is that we need the classifier response map (not the decision) forthe entire image to make some post-processing (spatial aggregation).

2 Classifier with Structured Features

We use a linear classifier that classifies fixed-size image patches Sx,y surrounding a particularpixel (x, y). The classifier learned by AdaBoost is composed of a set of weak classifiers and eachone is defined as a triplet (Mi, ti, gi), where Mi is its associated kernel (w × h matrix, eitherreal, R, or complex, C), ti is a threshold and gi = ±1 is a sign. A kernel has the same size asan image patch.

(a) (b)

(c) (d)

Figure 1: Motivation: the wheel detector. (a) The input image. (b) An example of a positive sample,25 × 29 wheel patch. (c) The FPGA output—the classifier response map. (c) The map post-processedby spatial aggregation; wheels can be detected as local maxima.

2

Figure 2: The selected kernels obtained from the AdaBoost learning using a large bank. All fan-likekernels are complex, their real parts are shown. The αi coefficient of the given kernel is encoded by color,from blue (low) to red (high). The kernel values are encoded as −1 = black, 0 = darker, +1 = brighter.

In contrast to traditional approaches using simple Haar-like features, the kernels in ourapproach can be arbitrarily complicated to suit the needs of any application. We need a classifierof car wheels, see Fig. 1b for an example of a positive sample. Rotational invariance is naturalrequirement here to avoid the need to represent kernels and learn a classifier for all possiblewheel rotations and risk an overfitting. We construct some of the kernels as a complex sinusoidwith unit L2 norm. The modulus of the dot product with such a complex kernel is then arotationally invariant feature.

Specifically, for an image patch Sx,y the dot product with the weak classifier’s kernel di =vec(Sx,y)>vec(Mi) is compared with the threshold to obtain the weak classifier’s decision yi,

vi =

{di if Mi ∈ Rw×h,

|di| if Mi ∈ Cw×h,yi =

{+1 if givi > ti,

−1 if givi ≤ ti.(1)

The weak classifiers are aggregated using weighting coefficients αi learned for each of them toform the overall classifier response

r(Sx,y) =n∑

i=1

αiyi. (2)

A boosted classifier does not pose any restrictions on the kernels. On the other hand, imple-mentation on an FPGA does. Therefore the kernel values have been quantized to {−1, 0,+1},allowing multiplications in the dot product to be replaced by simpler logical operations and theL2 norm in (1) by L1 norm,

vi = |<(di)|+ |=(di)| if Mi ∈ Cw×h . (3)

This however violates the perfect rotational invariance of complex kernels. Using a referenceMatlab implementation, we evaluated a classifier with quantized kernels and L1 norm (usedboth for learning and classification) to a classifier with kernels quantized to 8 bits and L2 norm,and proved that this modification has very little effect on the classification error.

Choosing the image patch size 25×29 pixels, we have learned a classifier of wheels consistingof 150 weak classifiers and 55 unique kernels (some weak classifiers share the same kernel). Bothlimits were fixed for the learning phase. See Fig. 2 for the list of selected kernels, includingnon-rotational ones.

3

Figure 3: Overview of the framework. The data source and sink consist of a general FIFO IP cores. Thesignals have the following meaning: data transfers data between blocks and data valid is a flag statingwhen the data are valid. These two signals are generally used between all blocks in the pipeline. Theenable signal, going in the opposite direction, signalizes that the data sink block is ready to receive newdata. This signal goes directly to the data source and stops new data. The pipeline has a non-zerolatency, so at that moment, the pipeline might contain valid data even after the enable signal has beendeasserted. For that reason, there is a FIFO at the end of the pipeline. Also, at a certain point in thepipeline we need to throttle the rate at which new data comes in. For that, we also use the enable signalleading to the data source.

Figure 4: The processing pipeline. For more information on each block see its respective section in thetext. The values in gray rectangles display the latency in clock cycles for each particular block. Alongwith the data wires, there are also valid signals that are used to mark valid data between blocks.

3 FPGA Implementation

The classifier from Sec. 2 works exhaustively on all image pixels, producing a dense responsemap. To get real-time performance, we have implemented it on an FPGA.

We have used the Xilinx ML605 evaluation board connected to the host PC over the PCIExpress. To hide the complexity of a PCIe data transmission, we have used the Xillybus wrapper[8], that provides simple interface on both the FPGA and host ends.

On the host side, we can open the FPGA as a generic file to write to and read from, andon the FPGA side, we have a standard FIFO IP cores as both data source and data sink. SeeFigure 3 for more details.

The PCIe wrapper guarantees delivery of data that have been sent to or from the FPGA.The pipeline itself has been tested during the synthesis to meet the timing constraints on datapaths to ensure that the data will not get lost due to signal hazards or collisions.

The whole processing pipeline is summarized in Fig. 4. The data source provides the imageand coefficients for intensity normalization. The classifier consists of the following blocks.

1. Linecache – a sliding window generator.2. Slice Selector and Batches – cuts the image patch into several slices and outputs only one

slice at a time.3. Normalizer – adjusts the pixel intensities in the image slice and re-quantizes them.4. Dot product-computing block.5. Modulus block – L1 norm (3) if a kernel is complex.

4

Figure 5: The Linecache block overview (drawn for an image patch of 4× 6 pixels for simplicity). Inputpixel arrives with each rising clock edge, shifting previously received pixels by one, acting as a slidingwindow. The block uses internal row and column counters to determine if the output data are valid. Forthe image patch that is 25 × 29, the cache needs to store 28 complete lines of the image, plus 25 pixelsfrom the current line. The cache consists of 25 times 29 latches and 28 BlockRAMs.

6. Weak Classifiers and Adder Tree – the comparison in (1), multiplication by αi and finalsummation in (2).

The blocks implement general functionality of a classifier. The specific classifier (of wheels inour case) is created by supplying numeric constants (patch dimensions, kernels, slicing, etc.), byinstantiating required number of parallel blocks and by specifying the routing of some signals.Using Matlab, we semi-automatically generated the appropriate part of VHDL source.

The individual blocks of the pipeline are described in more details in the following subsec-tions.

3.1 Linecache

This block works as a sliding window buffer for the input image. The data arrives at the speedof one pixel data per clock cycle and consist of 8 bit intensity value and 8 bit normalizationcoefficient. The pixels are ordered by rows, i.e., first comes the first image row, then the secondetc. The linecache block buffers up the incoming image pixels and outputs the whole imagepatch required for further processing.

The output has the same size as the dot product kernels (25 × 29 pixels). This block isimplemented as a shift register, consisting of both general flip-flops and BRAMs. See Figure 5for an illustration. The block has internal row and column counters to know whether the outputpositions in the register indeed contain a valid image patch. The block needs to wait at thebeginning of each frame before all the 28 lines buffer up and on each new line, before the 25pixels buffer up, before pronouncing the data valid.

This block has one more important role in the pipeline. The following blocks need more thanone clock cycle to process one pixel of data, so the linecache throttles down the rate at which itoutputs new data to make sure that there is enough time to perform the computation.

3.2 Slice Selector and Batches

This section describes generic algorithmic optimizations that are needed for real-time perfor-mance. This, together with Sec. 4, is the main contribution of this paper.

There are two issues concerning the amount of data to be processed. First, the number ofkernels is usually larger than the number of dot product-computing blocks (DPCBs, Sec. 3.4).

5

123

4

(a) (b)

slice number

1 2 3 4

krnl (b) • •

(c)

Figure 6: Both the image patch and all kernels have beendivided into four slices (with borders shown as thick redlines). (a) A 25 × 29 image patch divided to four slices(1–4). (b) An example kernel (black pixels mean nonzerovalues) occupying slices 2 and 3. (c) Kernel-Slice IncidenceTable. A single row in the table represents a single kernel.The ‘•’ mark means that the kernel has some non-zerovalues in the slice determined by the column. The examplekernel (b) occupies slices 2 and 3.

The wheel classifier uses 55 kernels, the complex ones are doubled, so in total 77 dot productsmust be computed but we are able to place only 30 DPCBs to the FPGA fabric in our case.

Second, processing 25 × 29 image patch at once is hard to implement efficiently. There areseveral generic reasons for that:

• Too many parallel routes (25× 29×n bits) would prove achieving the timing closure verydifficult.• There are not enough dedicated resources on the FPGA, namely BRAMs that are used to

store the kernels. More specifically, the amount of BRAM data pins is insufficient for usto have all kernel elements visible at once.• A non-negligible number of kernels cover a significantly smaller sub-region of the image

patch. Processing speed can benefit from some way of discarding the pixels not coveredby a kernel.

We solved the first issue by computing the kernel dot products in batches—each batchcomputing several dot products in parallel—evaluated in a sequence. To address the second issue,we proposed splitting the image/kernel window into several slices and process them sequentiallyin a particular batch.

Since the whole image patch has already been cached, the slice selector may select pixelscompletely arbitrarily. The slice pixels do not need to be adjacent nor is there any otherlimitation on which pixels may belong to what slice. The slices may also overlap and its unionneed not cover the whole patch, if the kernels do not require this.

We have chosen to use division into four slices, each containing 182 pixels. Since our wheelclassifier contains mostly circular kernels, we have set the slices to be circular with origin alignedto the origin of the kernel pattern, see Fig. 6. Of course, this is an application-specific choice.

The kernel distribution into the batches is discussed later in Sec. 4. Obviously, when all thekernels in a single batch do not need some of the slices, these slices can be discarded in thisbatch.

The slice selector block is a multiplexer that outputs one of the four slices at each clockcycle in a predefined (but arbitrary) order. This sequential code is specific for the selected setof kernels and slice shapes. The blocks following this one know the order as well and use itto select the correct slice of the correct kernel. Our specific wheel classifier uses slice sequence(1 | 1, 2 | 1, 2, 3, 4), where ‘|’ separates the batches.

6

Figure 7: The dot product-computing block overview. The input is a single slice of an image patch.The input pixels are multiplied with the corresponding kernel coefficients and then added together inthe adder tree. The output of the adder tree is a dot product of a single slice. The n-adder receivessequentially the dot products of all the slices and sums them up. Therefore the result is a dot product ofthe whole image patch with the given kernel.

3.3 Normalizer

In order to reduce the sensitivity of the classifier to local scene illumination level (multiplicative),an adaptive image patch normalization is performed. The coefficients for each image patch arecomputed in advance on the host PC based on the mean intensity of the patch using real-timeintegral image technique. Then, on the FPGA, each pixel in the image slice (8 bits) is multipliedby a given value (8 bits) and the result is re-quantized to 4 bits. The re-quantization helps withrouting and reduction of the data to be processed.

All these 182 multiplications are done at the same time and they all use dedicated DSPblocks on the FPGA.

3.4 Dot Product-Computing Block (DPCB)

This is the main part of the pipeline. Its input are slices of the image patch, one slice at a time.The output is a dot product of the entire image patch with an entire kernel. The kernel is storedinternally in a BRAM. See Fig. 7 for an overview of the block.

The first “multiplication” sub-block multiplies the image pixels with kernel pixels. Becausethe kernel values are limited to {−1, 0,+1}, we do not perform an actual multiplication. Instead,we only either change sign, set the pixel value to zero or forward it unchanged.

The Adder tree sums the results of the “multiplications” to obtain the sub dot product, i.e.,a dot product of a single slice with its corresponding kernel slice. The adder tree is constructedin such a way that in each clock cycle all pairs of values on level n are added and stored in ahigher level n+ 1 of the tree. There are as many levels as needed to have a single result at thetop of the tree. The valid flag is pushed up the tree as well to know when the final value is valid.See Figure 8 for illustration.

The last sub-block is the N-adder block. It adds sequentially the sub dot products to computethe final dot product of the current image patch and one complete kernel.

The DPCB has been placed into the FPGA fabric 30 times in our case. They all receivethe same input (the same slice of the image patch), but each block has a different set of kernelsstored in its BRAM, so each block computes a different dot product.

In theory, only 26 blocks should be sufficient (26 · 3 = 78 > 77). However the particularset of kernels introduces certain inefficiency when combining to minimize the number of clockcycles per pixel. It is therefore required to use more DPCBs than would be ideal. See Section 4for more details.

3.5 Modulus Block

The modulus block computes the L1 norm of a complex number as defined by equation (3).This block has one input for the real and one for the imaginary part. We need to assign complexkernels to the DPCBs so that the modulus block always receives the real and imaginary partsof the same dot product at the same time.

7

Figure 8: The Adder tree (example with only 13 numbers to add). On each level, all pairs are addedtogether and stored in the next level. If there is an odd number of elements on any given level, one ofthem is carried up without any change. The valid input is stored and carried along its respective datavalues up the tree and signalizes when the final result is computed. The idea is that we want to supplynew input values to the lowest level every clock cycle, so we need to have the old data processed by then,but we do not need to add them all (and have the result ready) in exactly one clock cycle. That wouldlead to an enormous use of logic cells. The whole addition process takes n clock cycles, where n is thenumber of levels in the adder tree.

For kernels that contain only real values the modulus block is not used and the signed valuefrom the DPCB is directly forwarded (with a delay equal to the modulus block) to the weakclassifier.

3.6 Weak Classifier and Adder Tree

The weak classifier, Sec. 2, is defined as a triplet of a sign gi, a threshold ti and a correspondingkernel Mi. There is also a weighting coefficient αi attached to each one. Recall that in our wheelclassifier we have 150 weak classifiers, each one having a different sign and threshold. There are55 unique kernels in total, several weak classifiers share the same kernel.

The i-th weak classifier block waits for the batch processing kernel Mi to finish. It appliesthe sign gi and compares the value to a threshold ti. Based on the comparison result, the signof αi is set and this value is sent as the output.

The wait-time, sign, threshold and αi values are hardwired to each block during synthesis.We have put all the 150 weak classifier blocks in the FPGA fabric. Each of these blocks isconnected either to one modulus block, or directly (via delay) to one DPCB, depending on thekernel it uses.

Once all the weak classifiers’ responses have been determined, they are summed up usingthe final adder tree.

4 Kernel Slicing Scheme

In this section we describe the key component to a formal performance optimization and design.The Kernel-Slice Incidence Table (Fig. 6c) defines for each kernel which slices must be used incomputation. When a kernel contains zero values in all pixels of a particular slice, this slice canbe discarded from computation, resulting in faster processing. Given a set of kernels and shapeof all slices, the derivation of the Incidence Table is straightforward.

As described in Sec. 3.2, since the number of kernels is larger than the number of DPCBs(30 < 77 in our case), the kernels must be separated into batches that are run in a sequence.

8

krnlslice number

1 2 3 4

1••

2••

3••

4••

5••

6••

7••

8••

23 •24 •25 •26 •27 •29 •39 •

-

-

-

-

-

-

-

req. FPs total: 23

krnlslice number

1 2 3 4

9• •• •

10• •• •

11• •• •

12• •• •

13• •• •

14• •• •

15• •• •

16• •• •

17• •• •

30 • •31 • •32 • •33 • •34 • •41 - •

- -

- -

- -

- -

- -

- -

req. FPs total: 24

krnlslice number

1 2 3 4

18• • • -

• • • -

19• • • -

• • • -

20• • • -

• • • -

21• • • -

• • • -

22• • • -

• • • -

28 • • • •35 • • • -

36 - • • -

37 - • • •38 • • • -

40 - • • •42 - • • -

43 - • • -

44 • • • -

45 - - • -

46 • • • -

47 • • • -

48 • • • •49 • • • •50 - • • •51 - - • •52 - • • •53 • • • •54 - • • •55 - • • •

req. FPs total: 30

Table 1: Possible batch assignment constructed from the Incidence Table for the set of filters in Fig. 2and 30 DPCBs. The ‘•’ mark slices with nonzero entries, ‘-’ mark slices containing all zeros that areprocessed anyway, and empty cells mark slices containing all zeros that are not processed. The slicesthat must be processed sequentially are (1 | 1, 2 | 1, 2, 3, 4). In total, the 30 DPCBs evaluate 7 · 30 = 210slices, while only 158 of them (the number of ‘•’ marks) contain nonzero data, i.e. the utilization is 75%.Note that both the real and the imaginary parts of a complex-valued kernel must be processed at thesame time—the doubled rows must not be separated.

All kernels processed in a single batch have to use the same slices. This means that NBi , the

set of slices processed in a particular batch i is a union of slices required by all kernels groupedin that batch. Apparently, the total number of slices N =

∑iN

Bi that must be processed in

sequence determines the overall processing time of an image patch. Thus it is advantageous togroup the kernels that require the same or similar set of slices, so that the N is minimized. SeeTab. 1 for example of grouping the kernels from Fig. 2.

To summarize, the Kernel Slicing Scheme is defined by1. the number and shape of slices,2. the Kernel-Slice Incidence Table, and3. the distribution of kernels to batches.In the current implemenation we have learned the classifier independently, manually created

the slice shapes and used semi-manual exhaustive search for distributing the kernels into batches.In Tab. 1 it is clearly visible, that the results are not optimal. One can possibly replace the kernels(probably exceeding the preset number of 55 kernels) to make the overall code length N shorter,speeding up the processing time while keeping the classifier performance. Alternatively, one canadd more kernels (and weak features) to use the free DPCBs, thus keeping the processing timebut improving the classifier performance. This can be easily incorporated into a weak featureselection method during the AdaBoost learning.

In general, all parts of the Kernel Slicing Scheme, as well as the classifier itself, should beoptimised with respect to the overall system performance (detection rates) and processing time,given the kernel bank, the training data for our classifier and the hardware constraints. This isa challenging yet unsolved problem that needs further research.

9

num. blocks clk cycles LUTs BRAMs DSPs

30 7 73202 371 36424 8 61086 305 36422 9 57820 283 36420 10 53445 261 36417 11 47976 228 36416 12 42924 217 36415 13 43252 206 36414 14 43280 195 36413 15 39577 184 36412 16 37887 173 36411 17 34492 162 36410 19 32456 151 364

Table 2: Resource consumption for different number of DPCBs for the image patch of size 25× 29. Eachblock adds 11 BRAMs and about 2000 LUTs. When decreasing the number of blocks, the time requiredto process one pixel increases. For larger FPGAs, it would be possible to place even more DPCBs than30.

overall per DPCB

patch size LUTs BRAMs DSPs LUTs BRAMs

9× 15 16803 67 68 369 115× 19 26506 111 144 731 219× 25 38506 177 238 1297 725× 29 53445 261 364 1952 1129× 35 76541 347 508 2720 15

Table 3: Resource consumption for different image patch sizes with fixed number of 20 DPCBs.

5 Performance, Scaling and Latency

The processing time depends on the size of the input image. Since we spend N = 7 clockcycles processing a single pixel, for a 951 × 400 pixel image we currently have cycle time of21.3 milliseconds per image1. The latency of the pipeline for a single pixel (only FPGA part, notincluding the DMA transfer to the host), has been shown in Fig. 4. For the used clock frequencyof 125 MHz, the total latency is 240 ns. This is negligible compared to the cycle time, makingthe effective latency equal to the cycle time.

Our proposed architecture is scalable and it is possible to adjust the number of DPCBs tosuit the size of a given FPGA (see Table 2). It might also be desirable to change the size of theimage patch which also greatly influences the design. See Table 3 for some examples. On theother hand, the size of the input image has almost no influence on the chip layout.

6 Design Challenges

During the development process we had to address several difficulties when designing on FPGA.In order to reach the required high degree of parallelism, we utilized 70% of hardware slices ofthe FPGA chip. We also need to transfer a lot of data around so the routing turned out to bedifficult. Together, this made achieving the timing closure a hard task.

It proved useful to use two slice selectors instead of one, each with different set of slices itoutputs, and to distribute the DPCBs between them. (There would be two slicing areas as seen

1951× 400 ·N · 1125MHz

= 21.3ms, where N = 7

10

in Figure 4.) We have tested this modification, and achieved slightly faster processing timesusing slightly less hardware resources: Slicer 1 code is (1 | 1 | 1, 2, 3, 4) and Slicer 2 outputsslices (1 | 1, 2 | 1, 2, 3), reducing the sequence length (and proportionally the processing time)from 7 to 6 slices. This modification brings yet another level of complexity to the optimizationproblem mentioned in Sec. 4.

7 Conclusions

In this paper we have proposed an approach for implementing a dense linear classifier on anFPGA. The top-level scheme is chosen as quite generic. The specific instantiation is then tunedto a particular application. An interesting optimization problem has been discussed, that wouldallow to efficiently choose a trade-off between processing speed and classification performance.

The architecture we have proposed is scalable, it is only up to the designer how many blocksare needed/allowed to place in the FPGA fabric. Also, the generic top-level scheme allows thedesigner to change the classification task by simply providing different kernels.

The performance evaluation on a Xilinx Virtex-6 ML605 board proved that the design isable to run fast, making it ideal for use in real-time applications. The proposed wheel classifierresponse map computation is used in a car detector running in an intelligent vehicle as a partof a more complicated collision mitigation system [1], that requires processing cycle of 20–30 fpsand a maximum latency of 200 ms.

Acknowledgement

This work was supported by the European Commission under interactIVe, a large scale inte-grated project, part of the FP7-ICT-246587 for Safety and Energy Efficiency in Mobility. Theauthors would like to thank all partners within interactIVe for their support. Special thanks goto Stefan Wonneberger at Volkswagen Group Research.

References

[1] P. Heck, J. Bellin, M. Matousek, S. Wonneberger, O. Sychrovsky, R. Sara, and M. Mau-rer, “Collision mitigation for crossing traffic in urban scenarios,” in Proc IEEE IntelligentVehicles Symposium, 2013.

[2] P. Viola and M. J. Jones, “Robust real-time face detection,” Int J of Computer Vision,vol. 57, no. 2, pp. 137–154, 2004.

[3] C. Gao and S.-L. Lu, “Novel FPGA based Haar classifier face detection algorithm accel-eration,” in Proc Int Conf on Field-Programmable Logic and Applications, 2008, pp. 373–378.

[4] M. Hiromoto, H. Sugano, and R. Miyamoto, “Partially parallel architecture for AdaBoost-based detection with Haar-like features,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 19, no. 1, pp. 41–52, 2009.

[5] C. He, A. Papakonstantinou, and D. Chen, “A novel SoC architecture on FPGA for ultra fastface detection,” in IEEE International Conference on Computer Design, 2009, pp. 412–418.

[6] M. Papadonikolakis and C. Bouganis, “Novel cascade FPGA accelerator for support vec-tor machines classification,” IEEE Transactions on Neural Networks and Learning Systems,vol. 23, no. 7, pp. 1040 –1052, July 2012.

11

[7] C. Kyrkou, C. Ttofis, and T. Theocharides, “FPGA-accelerated object detection using edgeinformation,” in Proc Int Conf on Field Programmable Logic and Apps, 2011, pp. 167 –170.

[8] Xillybus, “FPGA IP core for easy DMA over PCIe,” http://www.xillybus.com/.

12

center for machine perception fpga-accelerated sliding...

Documents