detection and handling of occlusion in an object detection...

Detection and Handling of Occlusion in anObject Detection System

R.M.G. Op het Velda, R.G.J. Wijnhovenb, Y. Bondarauc and Peter H.N. de Withd

a,bViNotion B.V., Horsten 1, 5612 AX, Eindhoven, The Netherlands;a,c,dEindhoven University of Technology, Den Dolech 2, 5612 AZ, Eindhoven, The Netherlands

ABSTRACT

Object detection is an important technique for video surveillance applications. Although different detectionalgorithms were proposed, they all have problems in detecting occluded objects. In this paper, we propose a novelsystem for occlusion handling and integrate this in a sliding-window detection framework using HOG featuresand linear classification. The occlusion handling is obtained by applying multiple classifiers, each covering adifferent level of occlusion and focusing on the non-occluded object parts. Experiments show that our approachbased on 17 classifiers, obtains an increase of 8% in detection performance. To limit computational complexity,we propose a cascaded implementation that only increases the computational cost by 3.4%. Although the paperpresents results for pedestrian detection, our approach is not limited to this object class. Finally, our systemdoes not need an additional dataset for training, covering all possible types of occlusions.

Keywords: Object Detection, Occlusion Handling, Histogram of Oriented Gradients (HOG)

1. INTRODUCTION

In this paper we focus on object detection, in particular but not exclusively, the detection of humans in thedomain of video surveillance. Detection of objects is a challenging task because of large variations in lighting(sun, shadows), object position and size, object deformations (shape) and large intra-class variations in objectand background. Although the quality of detection algorithms is constantly improving and partially solve theprevious challenges, state-of-the-art methods still struggle to detect objects that are occluded or are in unusualposes.1 Occlusion is a particular problem that is different from the previous challenges, since it takes awaypartial object information. The variation and amount of occlusion forms a problem of its own, which has notbeen broadly studied, so that we specifically investigate the handling of occlusions in this paper. Some typicalocclusions are visualized in Figure 1.

Popular object detection algorithms use a sliding-window detection stage, where a sliding classification windowis evaluated at different positions in the image. At each search position, the local image region is classified intoobject/background. To remove variations in contrast and light conditions, the raw intensity values of the imagepixels are typically first transformed into an invariant feature space. A popular feature descriptor for objectcharacterization is the Histograms of Oriented Gradients (HOG).2 The obtained feature description is thenclassified by a linear Support Vector Machines (SVM),3 which is selected for its simplicity and good performance.

A well-known dataset for occlusion experiments is the Caltech Pedestrian dataset, which focuses on pedestriandetection in an urban environment. Here, over 70% of all pedestrians are occluded in at least a single videoframe. Statistics on these occlusions show that 95% of all occlusions in this dataset occur from the bottom, theright and the left of the pedestrians.4,5 This aspect will be specifically addressed later in this paper.

Our work concentrates on improving an existing real-time sliding-window object detection system that useslinear classification (SVM). To this end, we explore the detection of occluded regions and compare this withthe detection of regions without occlusion. The first approach focuses on the detection of occlusions using theclassification score, whereas the second approach focuses on the detection of non-occluded regions using multipleclassifiers in parallel, each dealing with different partial occlusions. We will evaluate both approaches, and showthat the latter approach is most suited.

The remainder of the paper is organized as follows. We introduce related work in Section 2. Our existingobject detection system is described in Section 3. Then, Section 4 describes our implementations of the twoevaluated approaches. Section 5 outlines the applied datasets and experimental results. Experimental resultsare discussed in Section 6, followed by the conclusions in Section 7.

Figure 1. Typical occlusions in crowded scenes with pedestrians.

2. RELATED WORK AND DETAILED PROBLEM STATEMENT

Approaches from literature to handle such occlusions are mainly divided into two groups. The first group focuseson the detection of occlusions, whereas the second group concentrates on the detection of non-occluded regions.

Detection of occlusions is proposed in the following studies. Pixel-level segmentation methods6–10 obtain agood detection performance, but result in high computational cost. Such methods distinguish different objectsbased on the segmentation outcome and use the pixel data of the segmented areas to classify the objects. Toreduce computational cost, the segmentation resolution can be reduced from pixel to e.g. cell-level segmenta-tion.11 These techniques require pixel-level annotated data for training, which is preferably avoided. In a moredetailed approach, Monroy and Ommer7 propose a model-driven method to learn object shapes, without requiringsegmented training data. Object models for detection are learned by explicit representations of object shapesand their segregation from the background. All learned shape-models have to be matched with every detectionwindow, which is not feasible in real-time.

Wang et al.12 propose a technique that exploits classification scores per cell in a sliding-window detectionsystem. They integrate the concept of object segmentation within the detection window and ignore the influenceof occluded areas in the classification score for this window. Based on the localization of the occluded region,either an upper-body or a lower-body classifier will be evaluated on the non-occluded regions. This method isfeasible for real-time requirements and only requires bounding-box annotated training data, instead of pixel-segmented training data. This paper will be addressed further in Section 4.1. Marın et al.13 expand upon thework of Wang et al.12 Instead of using a partial classifier for ambiguous detections, these detections are evaluatedusing an ensemble of random subspace classifiers. A validation set is used to select the best subset of classifiers,which forms a considerable drawback, as this dataset needs to cover all possible types of occlusions.

The detection of non-occluded regions forms the second type of approaches. An evident first approach is toextend the image description with more informative features. As an example, HOG has already been extendedwith Locally Binary Patterns (LBP),12 Color Self Similarity (CSS)14 and Histograms Of Flow (HOF).14,15 Al-though this approach is useful, it emphasizes the object itself, but not the occlusions, so that it will have a limitedor no impact on handling the occlusions. As an alternative, part-based detectors16–20 are partially robust againstocclusions, because non-occluded parts are detected normally and occluded parts are missed. Tang et al.21 pro-pose multiple occlusion-aware classifiers are trained for pairs of specific combinations of occluding and occludedobjects (e.g. person-to-person occlusion). Although this approach provides good results for pairs of pedestri-ans, different classifiers have to be trained for each other possible type of occlusion, which is not feasible for areal-world application. Mathias et al.22 also use multiple classifiers that are evaluated for different sizes andpositions of occlusions.

Summarizing, we have adopted the work of Wang et al.12 as a starting point, because it is based on the genericsliding-window detection approach and requires only limited additional complexity. To evaluate the suitability,we have also designed and implemented a system that focuses on the detection of non-occluded regions, usingmultiple classifiers, which is in line with Mathias et al.22 This second method is chosen to have an alternativeapproach that is based on the same detection architecture, but uses a conceptually different solution.

Trainclassifier

Positive images

Negative images

Training stage (offline)

Final detections

Detection stage (online)

Detections

Thresholdtclass

Computefeatures

Classification

Sliding window

Occlusion handlingMerge detections

Computefeatures

A B

C D E F GThreshold

tfinal

Figure 2. Schematic overview of our object detection system. Training is performed offline in the top-left part(Block A and B) and normal (online) detection is depicted in the lower part (Block C–G).

3. BASELINE OBJECT DETECTION SYSTEM

The baseline object detection system is based on Histogram of Oriented Gradients (HOG) by Dalal and Triggs.2

The input image is first transformed into an invariant feature space that models object shape using orientationhistograms. Object detection is performed by sliding a detection window over the image and classifying thefeature description of this window into object/background. To detect objects of different sizes, the detectionprocess is repeated for scaled versions of the input image. Finally, the window-level detections belonging to thesame object are merged together into final detections. The total system is depicted in Figure 2.

Prior to object detection, the system is trained using example images of object (positive) and background(negative). The images are transformed in the HOG feature space (Block A) and the resulting feature descriptorsare used to train a classifier (Block B). After training (normal system operation), input images are first convertedinto the HOG feature space (Block C). The trained classifier is now used to detect objects in the sliding-windowdetection stage (Block D), which evaluates the image features. At each image search position, the classifierreturns a confidence score, which is thresholded (tclass) to obtain the detections (Block E). Because the objectwill be detected at several search positions and at several scales, all detections are merged by spatial clustering(Block F). Finally, these merged detections are also thresholded (tfinal) to obtain the final detections (Block G).

To describe the input image, we use the HOG feature transform.2 The image is divided into a spatial gridof cells of size 8 × 8 pixels. Gradient orientation information is calculated for each pixel and combined in ahistogram for each cell. Orientation information is quantized into 9 bins and weighted by the gradient magnitude.Pixel gradients are calculated by filtering with [1, 0, − 1] filters in both spatial dimensions. To normalize toimage contrast, the histograms are normalized using L2 energy normalization. Each cell histogram is normalizedmultiple times, using the energy of all blocks of size 2× 2 cells, of which this cell is part of. The feature vectorsfor all blocks belonging to the same detection window are concatenated and form the feature descriptor for thiswindow. A visual representation of the complete process is shown in Figure 3.

Intensity values Gradient calculation Orientations Combinedorientations

hi,j...

x1, x2, ... , xN

Histogram

Figure 3. Visual illustration of the HOG feature calculation process.

To obtain object detection, the feature space is searched in a sliding fashion and the feature vector of eachposition is classified into object/background. During training, the linear classification decision boundary iscreated using a Support Vector Machine (SVM) classifier. After training, the resulting linear classifier is used forobject detection, where the linear classification function is f(x) = ωT · x + b, where b is the bias, ω the weightvector (normal of the hyperplane) and x the concatenation of all the features for the detection window (featurevector).

4. APPROACH

The previously described sliding-window-based detection system is now extended with occlusion handling, wherewe focus on two different approaches. The first occlusion-handling approach detects occlusions using the cellscores of a full-window classifier. The second approach introduces a novel algorithm that focuses on designing amore occlusion-robust system by combining detections from multiple different classifiers.

4.1 Approach 1: occlusion detection

The first approach is based on the work of Wang et al.12 In case the score of the full-window classifier isambiguous (tlower < score < thigher), an occlusion might occur, the system uses the scores of the individual cellsto segment the detection window into positive and negative regions. Negative regions typically do not containan object and are ignored. Positive regions, possibly containing an object, are evaluated using a second partialclassifier. Based on the score of the partial classifier, either the score of the full-window classifier (sglob), thescore of the partial classifier (spart), or a combination of both is used. If the score of the partial classifiers is notsufficiently reliable (< tconf ), the global and partial classifiers are combined by a weighted combination of thescores, according to score = wpart · spart + wglob · sglob.

Negative cells decrease the score for the complete window, which can result in a misclassified sample. Toseparate negative cells from positive cells, first the score per cell needs to be computed. The score for a window iscomputed using the linear classification function from Section 3. The weight vector ω consists of weights for eachindividual cell. This vector is multiplied with the feature vector for the detection window, resulting in a weightper cell. In order to obtain the full-window classification score, a bias b is added, either to the full window or percell.12 The computed cell bias values are shown in Figure 4(b). Using these bias values per cell, the classificationscores per cell can be calculated. Based on these scores, cells can be merged into different positive or negativeregions. Negative regions indicate most likely occlusion or clutter and it is expected that ignoring these regionswill improve the detection score. The merging of the noisy per-cell classification scores is implemented with aMean-shift23 segmentation algorithm. For this purpose, two kernels are employed: a Gaussian spatial kernel anda linear kernel, as specified by

G(d) = e−d2

2σ2 , G(s1, s2) =

{0.5 + wms · s2 if s1 > 0,

0.5− wms · s2 if s1 ≤ 0.(1)

Here, d is the Euclidean distance between two cells, σ = 1 and s1 and s2 are the scores of the compared cells.

4.2 Approach 2: multiple classifiers for non-occluded regions

In this approach, we pursue the novel concept of assigning a classifier for each different non-occluded region.Each classifier is trained for a certain occlusion pattern, as shown in Figure 6(a). During detection, all classifiersare evaluated at each sliding-window search position. We assume that in case of occlusion, there will be atleast one classifier in the total set that matches to the current type of occlusion, so that a feasible detection iscomputed. It is important that the occlusion patterns and the related classifiers are representative for typicalocclusion cases. We have designed 29 different classifiers, as shown in Figure 6(a).

It seems straightforward to always choose the classifier that covers the smallest object region (largest occlusionregion), but we will show in our experiments that such a classifier, despite being more robust to occlusion, alwaysdecreases the detection performance. For this reason, once individual classifiers are designed and selected, we willcombine individual classifiers to create more robust classifiers for large occlusions. Since each classifier is trainedindependently, the margins for the linear classifiers are different. Classifiers covering a larger object region areincorporating more object information, so that they have better-defined margins and will perform better.

For combining multiple classifiers, several metrics can be used. To calibrate each individual classifier, classi-fiers can be normalized using the maximum achievable detection score, as described in,22 which is rather datasetdependent. Therefore, we propose an alternative to normalize each classifier by scaling the weight-vector ω (andits bias) of the linear classifier to unity energy. Furthermore, we calibrate tclass for all classifiers using a fixedfalse-positive rate, to statistically equalize the number of false detections. Our advanced normalization is notonly attractive for individual classifier normalization, but also paves the way for combining classifiers at a later

stage. Apart from individual normalization, the influence of each individual classifier to the combined result canbe varied. This can be performed by imposing a weight of unity minus occlusion level22 for each classifier. Asin,22 we assume that a small-object-region classifier performs always worse than a large-object-region classifier,so that the weight is adapted accordingly.

For initial experiments, we have compared four possible combinations of selectively applying individual clas-sifier normalization and applying weighting using the occlusion level, when combining the individual classifiers.These initial experiments have revealed that it is always preferred to apply both measures simultaneously.

4.3 Merging of detections

After we have obtained multiple detections per object during the sliding-window detection stage, these detectionshave to be merged. Dalal24 has proposed an elegant but computationally expensive Mean-shift procedure.A relatively simple method is proposed by Mathias et al.,22 which we refer to as NMS+Merging. We haveimplemented this two-stage approach because of the low computational requirements.

First, all detections are sorted to their scores. Then, Non-Maximum Suppression (NMS) is applied to alldetections belonging to the same classifier and the remaining detections are merged together. The criteria appliedby Dollar5 are used for non-maximum suppression (see Equation (2), left). If the overlap score is larger thanthe threshold tNMS , the detections are merged by ignoring the detection with the lowest score. In the secondstep, all remaining overlapping detections that satisfy the second overlap criterion are merged (see Equation (2),right). The previously stated NMS and overlap criteria are defined, respectively, as:

area(Ba ∩Bb)

min(area(Ba), area(Bb))> tNMS ,

area(Ba ∩Bb)

area(Ba ∪Bb)> tmerging. (2)

If the overlap score is larger than the threshold tmerging, the detections are merged and their detection scoresare accumulated. We have empirically determined the threshold values at tNMS = 0.7 and tmerging = 0.5.

5. EVALUATION

We will evaluate the performance and efficiency of the two implemented approaches: the occlusion detectionsystem from Wang et al.12 and our system based on multiple classifiers,where each classifier detects differentnon-occluded regions. More specifically, we evaluate (1) if it is possible to use per-cell scores in order to detectocclusions, and (2) whether combining multiple, different classifiers can increase the detection performance. Wefirst introduce the evaluation measures in Section 5.1 and then present the two datasets for our experiments inSection 5.2. For a better understanding of occlusion aspects related to the actual object class, we consider theimportance of the region of occlusion in Section 5.3. The occlusion detection system is evaluated in Section 5.4and our proposed multiple classifier system is extensively discussed and evaluated in Section 5.5.

5.1 Evaluation measures

To quantify detection performance, we plot the Detection Error Trade-off (DET)24 curves on a log-log scale,i.e., the miss-rate (1−Recall or FalseNeg

TruePos+FalseNeg ) versus the False Positives Per Window (FPPW). Evidently,low values for the miss-rate are desirable. The chosen parameters present the same information as the ReceiverOperating Characteristics (ROC), but allow small probabilities to be distinguished more easily. We will usemiss-rates at 1 × 10−4 FPPW as a reference point for comparison of different results, since this is a realisticpoint of operation for a detection system. This implies that 1 out of 10, 000 negative windows is misclassified asan object. For the INRIA dataset, the images have on the average 50, 000 windows per image, when processingscales 1.0 and higher, while using scale factors 1.05. To measure the performance of the complete system, we usethe ROC curves, in which we plot the miss-rate versus the False Positives Per Image (FPPI) and compare resultsusing the Area Under Curve (AUC) measure, where a lower AUC indicates a better detector performance. Wehave followed the same evaluation details as described by Dollar et al.,4,5 in order to obtain comparable results.

5.2 Datasets

We determine our results on the publicly available INRIA2 dataset, but have also employed our own, morecrowded Dancefestival dataset. The INRIA2 dataset contains 288 images with 1, 132 person crops of size 64×128pixels. The negative test set contains 453 images. The training set contains 1, 208 cropped object images (doubledby horizontal mirroring), and 1, 218 negative images. All results are reported for the 288 positive test images,while detecting pedestrians of 96 pixels or more in height. The negative images are used for the DET curves.

Additionally, our Dancefestival dataset consists of 864 annotated persons in 38 images of resolution 1, 280×960pixels, containing 323 occluded persons. This dataset is added to evaluate performance in a crowded real-worldscenario with many occlusions. Evaluation results are reported on pedestrians of height 48 pixels and up.

For all experiments, we have trained our classifiers on the INRIA Person training set. Since the bounding-boxannotations have different aspect ratios, we normalize all boxes to have a width of 0.41 times the height as in.5

We search the image 3 outside the image cell coordinates to find people at the image borders.

5.3 Importance of region of occlusion

In a first synthetic experiment, we evaluate the effect of occlusions at different object positions on the detectionperformance. We occlude each object image with a black rectangle at eight different vertical positions (height:2 cells/16 pixels, width: image width/64 pixels), as shown in Figure 4(a).

01

23

76

54

(a) Synthetic occlusion patterns 0–7.

Miss-rate (%)

7

Posi

tion

of

occ

lusi

on

/can

celin

g

CancelingBaseline (b2)(b1)

10 20 30 40 50 600

6

5

4

3

2

1

0

-1 0

(b) Classification performance for occlusions vs cell cancelling (miss-rates at 1 × 10−4 FPPW, lower is better). The average image of alltest images is depicted in (b1). Negative bias values per cell in (b2).

Figure 4. Different synthetic image occlusions in (a) and the evaluated classification performance in (b).

The images are evaluated using both our baseline system without occlusion handling, and the perfectocclusion-handling method that cancels the contribution of the occluding cells in the final classification result.During canceling, the corresponding feature dimensions are set to zero, while compensating for the bias of thesecells (see Subsection 4.1) by subtracting the corresponding values (visualized in Figure 4(b)). Note that morenegative cell bias values represent more important cells. The occluded area belonging to Position 1 is indicatedby the striped pattern.

From the results shown in Figure 4(b), we observe that the detection performance is position-dependentand deteriorates mostly for adding occlusions to Positions 1 (head) and 5 (knees), which indicates that theseregions have the highest importantance. In general, the miss-rates are higher than the performance with the non-occluded images (15%). When adding occlusion to the bottom image part (Position 7, the area below the feet),the performance without occlusion handling increases compared to the baseline, which is caused by the absenceof gradient information in this region. We can conclude that in all cases, occlusion decreases the performance ofthe baseline system. Furthermore, even in the case of perfect occlusion detection, cell cancellation still results ina decreased performance.

5.4 Approach 1: occlusion detection

We will now evaluate the effect of detecting occlusions at the HOG cell level, as proposed by Wang et al.12

and introduced in Section 4.1. After an extensive study to this approach, we have found that the outputof the cell-based occlusion detection (after Mean-shift segmentation) is only used to activate the two partialclassifiers (upper/lower body), provided that sufficient positive object information is present. The occlusionhandling in this case is only covered by the fact that a partial classifier is activated at the image part thatis not occluded. Therefore, we try to answer the following three questions. (1) What is the contribution ofthe segmentation compared to always applying the partial classifiers? (2) More specifically, is the occlusiondetection stage necessary, or is it sufficient to always evaluate the partial classifiers? (3) How is the decisionmade to activate the partial classifiers? To answer these questions, we evaluate the influence of the segmentationand the amount of non-occluded information (positive cells), which is required to activate the partial classifiers.

Two partial classifiers of size 8 × 8 cells are used, one for the upper body and one for the lower body. Bothare trained using 50% of the original annotations and with one round of bootstrapping. We use the followingparameters, tlower = −1, thigher = 1, tconf = 1.5, wpart = 0.3 and wglob = 0.7, which are similar to the settingsof Wang. We evaluate the effect of the weighting function from Section 4.1, by comparing with wpart = 1.0and wglob = 0.0. We use the number of positive cells to enable the partial classifiers and experimentally sweepthis number from 0 to 64, using steps of 4 cells. Note that a minimum number of 0 cells is equal to alwaysapplying the partial classifiers and a minimum of 64 cells is equal to never applying the partial classifiers.When both partial classifiers are activated, spart is equal to the maximum of both partial classifiers. None ofthe classifiers are normalized. The obtained results are depicted in Figure 5, using window-level classificationresults. Unfortunately, we have not been able to reproduce any results published in Wang et al.12 and expectthat they have performed additional processing steps not described in their paper.

Figure 5. Minimum of positive cells in each region vs miss-rates (for 1 × 10−4 FPPW). Lower values are better.

The use of only one full-object classifier (minimum of 64 positive cells) is always outperformed by the additionof partial classifiers. Best results without Mean-shift are obtained by activating partial classifiers when at least24 positive cells are found in each region. When enabling Mean-shift, the optimum is obtained when at least4 positive cells are found. However, this miss-rate is equal to the lowest miss-rate obtained without Mean-shift. Overall, adding Mean-shift performs worse, or the improvement in detection performance is negligible(0.3%). Weighting (wpart = 0.3, wglob = 0.7) is always required to compensate for the high number of falsedetections from the partial classifiers. Disabling weighting (wpart = 1.0, wglob = 0.0) always results in miss-rateswell above 15% and are therefore not shown. We have also experimented with Mean-shift on the binary cellscores, resulting in comparable results. Although this performance measure shows an increase in window-levelclassification performance, our integration in the complete system results in a decrease in performance whenmeasuring False Positive Per Image (FPPI) (including merging).

Summarizing, using segmentation information from occlusion detection results in a negligible increase inperformance, compared to always applying the partial classifiers. Moreover, the performance of the detectionsystem is even lower when it is embedded within the complete detection system (including merging). Using thesegmentation information improves performance best when only few positive cells are required to enable thepartial classifiers, showing that the noisy cell-based classification output cannot be employed directly. From theabove, we therefore conclude that occlusion detection is not necessary and the addition of partial classifiers isthe main source for the performance improvement.

5.5 Approach 2: multiple classifiers for non-occluded regions

In this section, we will evaluate the effect of detecting the non-occluded object regions. Already in the previousapproach, we have found that the main performance improvement originates from the application of two partialobject classifiers, instead of detecting the occluded region. We will now evaluate the concept of applying multipleclassifiers that cover different non-occluded regions in more detail. First, we evaluate which classifiers shouldbe combined to obtain optimal results. Second, we examine a method to combine multiple classifiers, whilecalibrating each individual classifier. Finally, we propose a cascaded implementation to lower the computationalcost for a real-time implementation.

5.5.1 Classifier design and evaluation of effective object region

We already know that occlusions typically occur at certain object regions (bottom, right and left).5 In a previousexperiment in Section 5.3, we have found that for persons, most informative visual information is concentratedaround the head area. We use these statistics and our findings to manually design 29 different classifiers, shownin Figure 6(a). The region that models the hypothesized occlusion area is ignored by the classifier and is drawnin black.

To demonstrate the importance of the effective object region covered by the classifier (the classifier size), wecompare 9 differently-sized classifiers, where the size of the occluded region is constantly increased (Classifiers0-7 and 28 from Figure 6(a)) . Each classifier is independently trained on the INRIA set and the performance isevaluated by measuring the AUC. The 9 different classifiers and their detection performances are shown in shownin Figure 6(b). Overall, we conclude that classifiers covering a larger region perform better. Classifier 1 obtainsthe lowest overall miss-rate, which we expect to be caused by noise in the lower part of the region and is ignoredin this classifier. This finding is supported by our findings from Section 5.3. We have obtained comparableresults for right-to-left and left-to-right occlusions.

(a) (b) Different classifiers with detection performances (AUC) (lower is better).

Figure 6. Multiple classifiers with different occlusion patterns in (a) and the detection performance of different classifierswith increasingly smaller object area in (b). Black areas represent occlusions where classifier weights have value zero.

5.5.2 Selection of classifier combinations

When applying multiple classifiers using different effective object regions, each classifier will focus on differentvisual properties. We assume that the combination of these different classifiers will result in an improved detectionperformance. However, it is difficult to predict which combination of classifiers performs best. Therefore, weevaluate all 29 classifiers from Figure 6(a) and measure their effective detection performance. First, all classifiersare applied independently to the images and the detections (after merging) from all classifiers are evaluatedto measure the contribution of each individual classifier. The amount of occurrences of each classifier is thenused to select a combination of classifiers. A classifier is counted when it has the largest contribution (highestdetection score) to a correct detection (ground truth). Classifiers are not weighted by the occlusion level, sothat large-region classifiers are not prioritized. Note that weighting is enabled when the selected classifiers areapplied in the final detection system. The classifiers are evaluated on both the positive INRIA test set and theDancefestival datasets. We compare normalized classifiers and set the thresholds for each classifier at a falsepositive rate of 1× 10−4.

Table 1. Different combinations of classifiers.Dataset Selection 1 2 4 7 17# Manual 28 7 11, 15 3, 9, 13 0-15, 28

INRIA Auto 1, 3, 14, 2, 10, 12, 27, 7, 8, 9, 13, 4, 26, 11, 5, 15, 19

Dancefestival Auto 9, 13, 27, 12, 14, 3, 15, 26, 1, 2, 10, 8, 4, 11, 5, 7, 6

The relative number of detections for each classifier is shown in Figure 7(a). Note that a low contribution ofa classifier means that there is another classifier giving higher detection scores on the same detections, therebymaking this non-contributing classifier inferior. From Figure 7(a), it can be seen that the classifiers 16-25are inferior for both datasets. The Dancefestival dataset contains more occluded persons, leading to a higherpreference for small-region classifiers and left/right occlusions. We now combine a selection of classifiers based onthis occurrence histogram and evaluate the performance of the classifier combination on the INRIA dataset. Tocompare the quality of the automated selection process, we also evaluate the performance of a manual selectionof classifiers. The classifier numbers of the selections are listed in Table 1. The first row shows the manuallyselected classifiers, while rows two and three depict the selections generated from the INRIA and Dancefestivaldatasets, respectively. Note that the classifier numbers correspond to the numbers from Figure 6(b).

Increasing the number of combined classifiers increases detection performance and converges at around 11 clas-sifiers. In general, adding more classifiers decreases the performance. In all three considered cases, a clear opti-mum occurs and from this point onwards, the performance always decreases when adding more classifiers. Thisimplies that there is an optimal set of classifiers. Additional classifiers only add already found detections (re-dundancy) and therefore only generate false detections. With automated selection, the best results are obtainedwith 11 classifiers selected from the Dancefestival dataset. However, an even better performance is obtained withthe combinations of 7 and 17 manually selected classifiers. Note that the selection of the first classifier is mostcritical. This is clearly seen from the Dancefestical selection, where Classifier 9 is selected as the first classifier,modeling a significant amount of occlusion.

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28Classifier number

02468

10

Am

ount

of

tota

l det

ectio

ns (%

)

DancefestivalINRIA

(a) Influence of individual classifiers on total detections.

0 5 10 15 20 25 3030%

Number of classifiers

AU

C

INRIA autoDancefestival autoManually selected

32%

34%

36%

(b) Detection performance for multiple classifiers.

Figure 7. Combining multiple classifiers and measuring the influence of individual detectors in (a) and evaluating thedetection performance of several classifier combinations in (b).

(a) Detection using 1 classifier (b) Detection using 17 classifiers

Figure 8. Example detections when applying 1 vs. 17 classifiers. Note that multiple classifiers enable the detection ofsignificantly occluded persons.

5.5.3 Real-time implementation

Although this advanced occlusion handling increases the detection performance, adding more classifiers increasesthe computational costs linearly with the amount of classifiers. To reduce these costs, we propose a cascadedimplementation that limits the amount of comparisons at each sliding-window search position. At each position,the largest-region classifier is evaluated first and only when the classification score is above a threshold, all otherclassifiers are evaluated. This already discards many search positions after applying the first classifier. We haveevaluated both the computational complexity and the detection performance for a system with 1, 2, 4, 7 and 17classifiers when operated with different threshold values (tclass).

The results for the manually selected classifiers are depicted in Figure 9. This figure visualizes both thecomputational costs and the detection performance, both relative to the baseline system with one classifier. Thecomputational costs are shown by the bars and are linked to the left vertical axis. Here, the value 100% representsthe computational costs when applying the single classifier baseline system. The detection performances of thedifferent systems are shown by the lines and are linked to the vertical axis at the right. Here, the value 100%represents the performance of the single classifier baseline system.

Combining up to 17 classifiers, the performance always increases. However, combining more than 7 classifiersdoes not improve the performance significantly. Using 7 classifiers, the computational costs increase by 1.3%,while increasing the associated detection performance by 7.6%. A suitable trade-off between performance andcosts would be to adopt 17 classifiers, using initial threshold −0.2, leading to an improvement of 8% in detectionperformance for 3.4% higher cost. This combination of classifiers detects more occluded objects, as shown in anexample picture in Figure 8.

Figure 9. Performance and computational costs for combinations of different numbers of classifiers. Detection performanceis indicated by the lines (refer to axis at the right) and computational costs are indicated by the bars (refer to axis at theleft). All classifier combinations are manually selected, as in Table 1.

6. DISCUSSION

We have evaluated two conceptual approaches for occlusion handling: detecting occluded object regions anddetecting non-occluded regions. Although in our experiments we have shown that the detection of occlusions isnot preferable, these results can be improved when using multiple image features. We have found that despiteWang et al.12 describe the use of only HOG features for occlusion detection, instead they combine HOG withLBP ∗. Marın et al.13 show that extending HOG with LBP results in a significant improvement of the detectionperformance. Furthermore, our method for automated classifier selection is not optimal. A more elaborateselection procedure should select mutually exclusive classifiers to improve combined detection performance.

We also want to put a critical note with the evaluation measures and parameters. Not all publicationsapply the same evaluation criteria and often, individual algorithms are based on different parameters (such asscales, dataset details and other algorithm parameters). In order to obtain comparable evaluation results, Dollaret al.4,5 made a framework for the objective evaluation of detection results. Unfortunately, comparing differentimplementations of the same algorithm is still influenced by the applied algorithmic parameters.

The state-of-the-art detection performance on the INRIA dataset is obtained by Mathias et al.,22 which obtaina miss-rate of 16.62%. By adding occlusion handling, the miss-rate decreases to 13.70%, while the computationalcost increases with 330%. In our system with 17 classifiers, the miss-rate is lowered from 33.88% to 31.05%, whileonly increasing computations by 3.4%. This performance difference is caused by the difference between our simpleHOG features, vs. the more discriminative features from.22 However, these features are specific to the objectclass and introduce additional computational complexity. Finally, we want to remark that we have recovered thework of Mathias et al.22 from literature only when already completing our own work. Although this resulted inseveral similarities, it also shows that the concept of applying multiple classifiers to detect non-occluded objectregions provides a suitable solution that is now supported by two relatively independent investigations.

7. CONCLUSION

In this paper, we have proposed a novel system for occlusion handling and integrated this in a sliding-windowdetection framework, using simple HOG features and linear classification. Occlusion handling is obtained bythe combination of multiple classifiers, each covering a different level of occlusion. For real-time detection,our approach with 17 classifiers obtains an increase of 8% in detection performance, with respect to the baselinesystem. We have proposed a cascaded implementation that only increases computational cost by 3.4%. Althoughwe only present results for pedestrian detection, our approach is not limited specifically to this object class.Moreover, the fixed HOG feature transformation allows for an extension towards other object classes withoutadditional class-specific feature calculation. Pre-defining the types of occlusions prior to training creates theadvantage that we do not need an additional dataset for training, which covers all possible types of occlusions.

We have revealed that the effect of occlusion on the detection performance is position-dependent and forpedestrian detection, performance deteriorates mostly for occlusions around the head and knees. After imple-menting and evaluating the method by Wang et al.,12 we conclude that the largest contribution of the proposedocclusion handling is not caused by the cell-based occlusion detection and region merging, but originates fromthe addition of partial classifiers (upper/lower body).

We have found that simply applying small-region classifiers that cover only a part of the object (e.g. head-onlydetector) and therefore can handle more occlusions, strongly decrease the detection performance. Combiningmultiple classifiers increases the detection performance up to a certain optimal number of classifiers. Adoptingmore classifiers beyond this point only adds already found detections (redundancy) and generates false detections,thereby effectively decreasing the detection performance. We have proposed an automated selection method forclassifiers using statistics based on the occurrences of occlusions. Although this method performs better in somecases, the automatic selection is strongly dependent on the dataset and is regularly outperformed by the manualselection of classifiers. Besides this, the selection of the first classifier has been found to be most critical forthe final system operation. Automated selection can be further improved by an iterative classifier selectionmethod that removes intermediate detections. This combination of multiple classifiers enables the detection ofthe persons that are strongly occluded.

∗Personal communication

REFERENCES

[1] Hoiem, D., Chodpathumwan, Y., and Dai, Q., “Diagnosing error in object detectors,” in [Proc. EuropeanConference on Computer Vision (ECCV) ], 340–353, Springer (2012).

[2] Dalal, N. and Triggs, B., “Histograms of oriented gradients for human detection,” in [Proc. Conference onComputer Vision and Pattern Recognition (CVPR) ], 1, 886–893 vol. 1 (2005).

[3] Cortes, C. and Vapnik, V., “Support-vector networks,” Machine learning 20(3), 273–297 (1995).

[4] Dollar, P., Wojek, C., Schiele, B., and Perona, P., “Pedestrian detection: A benchmark,” in [Proc. Confer-ence on Computer Vision and Pattern Recognition (CVPR) ], 304–311, IEEE (2009).

[5] Dollar, P., Wojek, C., Schiele, B., and Perona, P., “Pedestrian detection: An evaluation of the state of theart,” Trans. Pattern Analysis and Machine Intelligence (PAMI) 34(4), 743–761 (2012).

[6] Winn, J. and Shotton, J., “The layout consistent random field for recognizing and segmenting partiallyoccluded objects,” in [Proc. Conference Computer Vision Pattern Recognition ], 1, 37–44, IEEE (2006).

[7] Monroy, A. and Ommer, B., “Beyond bounding-boxes: Learning object shape by model-driven grouping,”in [Proc. European Conference on Computer Vision (ECCV) ], 580–593, Springer (2012).

[8] Yang, Y., Hallman, S., Ramanan, D., and Fowlkes, C., “Layered object detection for multi-class segmenta-tion,” in [Proc. Conference Computer Vision Pattern Recognition (CVPR) ], 3113–3120, IEEE (2010).

[9] Gould, S., Fulton, R., and Koller, D., “Decomposing a scene into geometric and semantically consistentregions,” in [Proc. International Conference on Computer Vision (ICCV) ], 1–8, IEEE (2009).

[10] Gould, S., Gao, T., and Koller, D., “Region-based segmentation and object detection,” in [Advances inNeural Information Processing Systems ], 655–663 (2009).

[11] Gao, T., Packer, B., and Koller, D., “A segmentation-aware object detection model with occlusion handling,”in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR) ], 1361–1368, IEEE (2011).

[12] Wang, X., Han, T., and Yan, S., “An hog-lbp human detector with partial occlusion handling,” in [Proc.International Conference on Computer Vision (ICCV) ], 32–39 (2009).

[13] Vazquez, D., Lopez, A., Amores, J., Kuncheva, L., et al., “Occlusion handling via random subspace classifiersfor human detection,” Transactions on Systems, Man, and Cybernetics (Part B) 44(3), 342–345 (2013).

[14] Walk, S., Majer, N., Schindler, K., and Schiele, B., “New features and insights for pedestrian detection,” in[Proc. Conference on Computer Vision and Pattern Recognition (CVPR) ], 1030–1037, IEEE (2010).

[15] Dalal, N., Triggs, B., and Schmid, C., “Human detection using oriented histograms of flow and appearance,”in [Proc. European Conference on Computer Vision (ECCV) ], 428–441, Springer (2006).

[16] Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D., “Object detection with discriminativelytrained part-based models,” Trans. Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010).

[17] Leibe, B., Seemann, E., and Schiele, B., “Pedestrian detection in crowded scenes,” in [Proc. Conference onComputer Vision and Pattern Recognition (CVPR) ], 1, 878–885, IEEE (2005).

[18] Mikolajczyk, K., Schmid, C., and Zisserman, A., “Human detection based on a probabilistic assemblyof robust part detectors,” in [Proc. European Conference on Computer Vision (ECCV) ], 69–82, Springer(2004).

[19] Wu, B. and Nevatia, R., “Detection of multiple, partially occluded humans in a single image by bayesiancombination of edgelet part detectors,” in [Proc. International Conference on Computer Vision (ICCV) ],1, 90–97, IEEE (2005).

[20] Fergus, R., Perona, P., and Zisserman, A., “Object class recognition by unsupervised scale-invariant learn-ing,” in [Proc. Conference on Computer Vision and Pattern Recognition (CVPR) ], 2, II–264, IEEE (2003).

[21] Tang, S., Andriluka, M., and Schiele, B., “Detection and tracking of occluded people,” International Journalof Computer Vision , 1–12 (2012).

[22] Mathias, M., Benenson, R., Timofte, R., and Van Gool, L., “Handling occlusions with franken-classifiers,”Proc. International Conference on Computer Vision (ICCV) (2013).

[23] Cheng, Y., “Mean shift, mode seeking, and clustering,” Trans. Pattern Analysis and Machine Intelligence(PAMI) 17(8), 790–799 (1995).

[24] Dalal, N., Finding people in images and videos, PhD thesis, Institut National Polytechnique de Grenoble-INPG (2006).

detection and handling of occlusion in an object detection...

Documents