efﬁcient model evaluation with bilinear separation model · using a bilinear separation model...

Efficient Model Evaluation with Bilinear Separation Model

Fanyi Xiao Martial HebertThe Robotics Institute, Carnegie Mellon University

Abstract

In this paper, we investigate the issue of evaluating ef-ficiently a large set of models on an input image in detec-tion and classification tasks. We show that by formulatingthe visual task as a large matrix multiplication problem,something that is possible for a broad set of modern de-tectors and classifiers, we are able to dramatically reducethe rate of growth of computation as the number of modelsincreases. The approach, based on a bilinear separationmodel, combines standard matrix factorization with a task-dependent term which ensures that the resulting smaller sizeproblem maintains performance on the original task. Ex-periments show that we are able to maintain, or even ex-ceed, the level of performance compared to the default ap-proach of using all the models directly, in both detectionand classification tasks. This approach is complementaryto other efforts in the literature on speeding up computa-tion through GPU implementation, fast matrix operations,or quantization, in that any of these optimizations can beincorporated.

1. IntroductionAs techniques for object detection and image classifica-

tion tasks become increasingly powerful, one can contem-plate using them in applications in which a large numberof models are used. However, as the number of modelsincreases, so does the computation effort involved, to thepoint that the computational cost may become prohibitive.In this paper, we investigate a general approach to effi-ciently run a broad class of detection and classification al-gorithms by estimating their responses on an input imageusing a bilinear separation model which is much more com-putationally economical than evaluating every model. Thestarting point of our approach is the well-known observationthat, conceptually, many common detection and classifica-tion algorithms can be summarized in a rather general form[28, 18, 22, 27, 26]:

R = WTX,W∈RD×M , X∈RD×N , (1)

where M is the number of models (e.g., HOG templates inobject detection or weights for bag of words feature in im-

U V

Testing Phase

Bicycle Models WLow-rank U,V

OutputTraining Phase

Figure 1: The illustration of the training and testing phaseof the proposed bilinear separation model (BSM). The leftpicture illustrates the training phase of BSM: the low-rankmatrices U and V are learned to mimic the behaviour of theoriginal model W in terms of the prediction performance.Whereas the right picture illustrates the testing phase ofBSM for efficient model evaluation: the HOG patch is firstprojected to the low-rank basis U (red) and then evaluatedwith V (blue), which contains the predictors for differentmodels.

age classification problems), N is the number of data pointsto evaluate, e.g., the number of windows in a detection task,and D is the dimension of the feature. The way the modelsW are learned and the features used in X , of course, de-pends on the specific approach but the fundamental opera-tion applied to an input image reduces to (1). Our approachrelies on approximating this operation in order to reducethe computation spent on this matrix-matrix multiplication,which is, in many cases, the most computational intensiveoperation in vision algorithms.

The general form of (1) applies to many standard ap-proaches both in detection and classification. For exam-ple, the popular object detection paradigm, i.e., sliding win-dow search with linear classifiers over HOG representation[3, 10, 18, 14], can be easily put in the form (1). In thissetting, the matrix W∈RD×M consists of the stacked pa-rameters wi, i = 1, 2, ...,M∈RD corresponding to eachmodel, i.e., the linear classifier trained over the HOG fea-ture. Similarly, each column of the matrix X is the HOGfeature vector extracted from a specific sliding window in

the HOG pyramid of the image. Furthermore, as for themore complicated part-based models, the form (1) couldalso be applied to get the part responses (as well as theroot response) followed by a distance transform to get thefinal scores. In the realm of image classification, the Im-ageNet dataset and the associated challenge has becomea benchmark for large-scale image classification problem[5]. Among various state-of-the-art approaches on this chal-lenge, many different features and encoding methods areemployed [17, 15]. However, most share the same way ofapplying the classifiers to an input image, that is to evalu-ate all the models, represented by a matrix W in which thevectors of weights of all the models are stacked, on all theimages X: R = WTX as in (1).

As implied by (1), the computational complexity of eval-uatingM models onN data points withD dimensional fea-ture is O(MDN), which would easily become prohibitiveas all these three parameters scale up quickly [18, 25]. In or-der to apply (1) in a more efficient way, a natural approachwould be to use a low-rank approximation for the param-eter matrix W = UV T , U∈RD×L, V ∈RM×L, thus trans-forming the problem into the form R = WTX = V UTX ,which is much more efficient if L, the constrained rank,is small enough compared with D. An intuitive way ofaccomplishing this low-rank approximation is to apply anSVD-based matrix decomposition to W , which implicitlyminimizes the reconstruction error ||W − UV T ||2 subjectto the rank constraint. However, this “obvious” approach isnot appropriate. First, the objective minimized by this ap-proach is not directly related to the prediction performanceof model evaluation. Second, this factorization does nottake into account any data for the specific task. We pro-pose in this paper a novel approach that addresses these twoproblems by using a bilinear separation model, which hasbeen successfully applied in many different computer visiontasks [28, 22, 12, 31]. We propose to learn a bilinear modelby optimizing the criteria that is directly related to the pre-diction performance for specific tasks. As we will showin Section 2, the proposed factorization can be computedwith a coordinate-descend method based on its bi-convexityproperty [13, 22].

Many previous works have been published to addressthe computation issues in vision tasks. The work most re-lated to ours are from Ramanan et al. [21] and Song et al.[27, 26], which are both based on the idea of decompos-ing classifiers to efficiently evaluate (1). However, there aresignificant differences with our work. In [21], they decom-pose the weight parameters for each single classifier whichcorresponds to one column of W in equation (1). Whereasour idea is to find out the factors shared among all classi-fiers. [27, 26] use sparse coding to learn an overcompletebasis whereas we directly learn a low-rank compact basiswhich preserves the predicting capability. Also, previous

works directly use their approximation as the detection re-sponses, instead we show that we can use a re-evaluationstep to recover the detection performance, which relaxes theneed for the approximation accuracy while maintaining thehigh performance on the specific tasks. Furthermore, ourwork differs from the prior works in that it has the capa-bility of exploiting pre-trained models: instead of trainingfrom scratch, our method is able to treat pre-trained modelsas black boxes and mimic their behaviour while being muchmore efficient. Finally, we show that our method is able toachieve performance almost identical to directly evaluating(1) which is hard for both [21] and [27, 26].

Other works attempt to reduce either the number of win-dows evaluated, or the number of models evaluated. Thecascaded methods are applied to select testing windowsthat are with high probability containing objects throughcheaper operations [29, 9, 20]. More recent approachespropose that, instead of modelling the probability of con-taining certain objects (meaning with supervision), they di-rectly model the interestingness of image windows with-out any supervision [19, 1, 7]. A complementary line ofresearch aims at reducing the number of models M . Forexample, Pirsiavash et al. propose to use the shared steer-able filters as the basis for constructing a large number ofpart templates [21]. Also aiming at reducing the number ofmodels, Dean et al. take a different approach using locality-sensitive hashing for model selection at every window [4]in order to scale to 100000 categories1. In addition to thesetwo complementary lines of research, [24] accelerates themodel evaluation directly by vector quantization. Instead ofonly selecting models or windows, we propose to estimatethe responses of every model on every window by a learnedbilinear separation model, then re-evaluate a small fractionof the top-ranked “model-window” pairs, according to theestimated response values, to recover the detection resultsas if we have evaluated all the “model-window” pairs. Notethat even though we also select “model-window” pairs forre-evaluation as [24] did, we take a very different approachthat follows the model evaluation procedure of many pub-licly available object detector packages so that it is fairlyeasy to integrate our approach.

Other approaches attempt to reduce computation by op-timizing the computations involved in (1) directly, ratherthan reducing the complexity of the model representation.They include GPU implementations [2, 27], fast matrix op-erations [6], or quantization [24]. It is important to note thatour approach is complementary to these efforts in that anyof these optimization can be incorporated in our approachto achieve further speed-up.

Our contributions are threefold: (1) We present a method

1Girshick et al. [11] demonstrated that large matrix multiplication,which is what we study in this paper, is much more efficient than hash-ing technique.

that treats the pre-trained models as black boxes and mimictheir behaviours which enables us to exploit the largeamount of pre-trained models available nowadays. (2) Wepresent both theoretical and empirical results on the compu-tation to show that our method scales gracefully as the num-ber of models increases. (3) Our method is able to achievealmost identical performance to the exhaustive model eval-uation as we demonstrated on both object detection and im-age classification tasks.

2. Learning a Bilinear Model for EfficientModel Evaluation

In this section, our goal is to learn a low-rank approxi-mation W≈UV T such that R = V UTX gives us a goodestimation of R = WTX in terms of the task-specific pre-diction performance. The general idea of the training andtesting phase of efficient model evaluation through low-rankapproximation is shown in Figure 1.

2.1. SVD on Model Parameters W

Before describing our approach, let us first look at anintuitive way of accomplishing the low-rank approximationand point out its shortcomings to motivate our approach.

The correlation among different models has been ex-ploited by many approaches [27, 26]. The key observationis that since models wi, i = 1, 2, ...,M are, in most cases,highly related to each other, e.g., detectors of visually sim-ilar objects, it makes sense to model this correlation to pro-duce more compact representations for these models, and toevaluate them more efficiently. One of the most intuitive ap-proach to model the correlations is to apply Singular ValueDecomposition (SVD) over the stacked model parameters,and obtain a low-rank approximationW≈UDV T , in whichU serves as a low-rank projection basis whereas each rowof V as the combination coefficients on this basis for eachmodel. Thus, the model evaluation could be approximatedas:

R = (V D)UTX = V UTX,U∈RD×L, V ∈RM×L, (2)

where V = V D. This computation could therefore be doneby first computing Xp = UTX , which is a low-rank pro-jection of the data, and then R = V Xp.

There are two problems with this seemingly reasonableapproach. First, the SVD-decomposition is implicitly opti-mizing the objective ||UV T −W ||2 subject to the low-rankconstraint. However, there is no guarantee on the perfor-mance of the specific tasks by optimizing the above objec-tive. For example, it is possible to have a pair of U1 andV1 which has lower value according to the above objectivefunction, but has poorer prediction performance on the de-sired task than another pair U2 and V2. Second, the form||W −UV T ||2 does not take into account any data, whereas

in the form (1), WT is applied to the data matrix X . Di-rectly optimizing ||W−UV T ||2 is equivalent to making theassumption that each dimension of our data (visual featuresextracted from images) has the same magnitude in expecta-tion, which is clearly not the case. To overcome these twoproblems, we propose to learn a low-rank Bilinear Separa-tion Model (BSM) which we can still apply as shown in (2),by optimizing an objective that is data-dependent and di-rectly related to the prediction performance. It is called abilinear form because it is linear with regard to each of thetwo variables over which we aim to optimize given anotheris fixed.

2.2. Bilinear Separation Model

To justify the objective we propose to optimize, we takethe object detection task as an example for illustrative pur-pose. The similar reasoning holds for the image classifi-cation task. We first introduce some notations to help usformalize the derivation of our model: Xi and Xj denotethe ith column and jth row of matrix X , respectively. Con-sequently, Xj

i refers to the element on the ith column andjth row of X.

The evaluation of (1) provides scores for objects andtheir locations. Usually a threshold is then applied on thisresponse matrix R to obtain the candidate detections, i.e.,the “model-window” pairs. Therefore, what we really careabout for an estimate of R, or the quality of an estimate ofR, is that for those “model-window” pairs that are abovethe threshold {Rij |(i, j)∈Sc(t)}2, they remain above thethreshold in the approximation R whereas for those pairsbelow the threshold, it is desirable they remain below thethreshold as well. More formally, we want an approxima-tion R such that the following conditions are satisfied:

Rij≥t,∀(i, j)∈Sc(t)Rij < t,∀(i, j) 6∈Sc(t).

(3)

In order to enforce the separation constraint in (3), onewould like the estimation R to be penalized if it estimatesa point to be above the threshold that is in fact belowthe threshold and vice versa. We show that this could beachieved using the following loss function:

L1(Rij , yij) = max(0,−yij ·Rij), (4)

where yij∈{−1, 1} is a variable indicating whether the cor-responding value in the ground-truth response matrix Rij isabove the threshold:

yij = 1{Rij≥t}, (5)

where the function 1{·} returns 1 while the statement istrue, -1 otherwise. Although it represents the constraint

2Sc(t) is used to denote the set of “model-window” pairs(model index,window index) that are above the threshold t.

shown in (3), the loss function in (4) is a non-smooth func-tion. To make use of many solvers designed for smoothfunctions, we choose to optimize the squared version of (4):

L2(Rij , yij) = max(0,−yij ·Rij)2. (6)

Note that this squaring operation is similar to theL2-loss forSVM [16]. Having the loss function representing the con-straint (3), which is directly related to the prediction per-formance, we can then apply this loss function (6) to thelow-rank approximation R = V UTX , which leads to thefollowing objective function3:

J =∑M

i=1

∑Ni

j=1 L2(V iUTX(i)j , y

(i)j ) + γ||W − UV T ||2,

U∈RD×L, V ∈RM×L.(7)

Recall that M is the number of models, X(i)∈RD×Ni

is the data matrix for model i, Ni is the number of datapoints for the ith model and X(i)

j is the jth column of X(i).

Similarly, y(i) is a vector of variables for model i, y(i)j is the

indicating variable for the jth data point X(i)j . V i is the ith

row of matrix V .The first term in (7) enforces the squared separation loss

function over the low-rank approximation R = V UTX .Thus, the learned bilinear model U and V are well-tunedto predict whether or not a “model-window” pair would bea candidate detection. Also, the data X(i)

j in the first termmakes it a data-dependent term compared with direct SVDover W . The data-dependency enforces adaptation to thedistribution of the training data. The second term γ||W −UV T ||2 is exactly the objective of the SVD approach inSection 2.1 multiplied by a scalar weight γ, which serves toregularize the optimization.

Note that our objective function is closely related to otherforms of matrix factorization in machine learning, e.g.,max-margin matrix factorization [23]. However, there arealso important differences between those approaches andours. Recall that M and D are the number of models andthe dimension of features. First, suppose that there is norank constraint on U and V , which means that the sizes ofU and V are [D×min(M,D)] and [M ×min(M,D)], re-spectively. Thus, the optimum of the objective (7) is exactlythe original model parameter W , since both the first and thesecond term of (7) would be zero. That is, our formulationwill get closer and closer to the original parameter matrixW as the rank constraint is relaxed. On the contrary, themax-margin matrix factorization enforces a large-marginloss function, i.e., an SVM-like hinge loss max(0, 1−y·x),

3In practice, both the original model wi and our learned bilinear modelincorporate a bias term bi, here for the simplicity of the formulation, weassume that data is pre-processed to zero-mean, which does not affect anyconclusion made here.

which prevents it from recovering the original model pa-rameters W even when we relax the rank constraint.

Although we used the object detection task to explain ourapproach, the proposed optimality objective also applies tothe image classification problem: instead of being imagepatches as in detection, X in (1) becomes feature from theglobal image in the image classification task.

2.3. Optimization

The objective shown in (7) is quadratic and thereforeconvex, with respect to the optimizing variable V when Uis fixed, and vice versa. This property is called bi-convexityand has been extensively studied [13]. Because of the bi-convexity, (7) could be optimized using a coordinate de-scend method alternating between optimizing V and U .

Initialization: Since the problem is not jointly convexin U and V , a careful initialization is required. With a givenrank L, we choose to initialize the low-rank projection basisU and the coefficients V with the SVD over W , which isequivalent to the second term in the objective (7).

Updating V: When U is fixed, the updating of V isstraightforward. The matrix V can be updated by sequen-tially updating each row of V , which correspond to the co-efficients on the basis U for different models. The sub-problem of optimizing each row of V can be formulatedas the following convex program and thus be solved usingthe Quasi-Newton method.

minV i

Ni∑j=1

L2(V iUTX(i)j , y

(i)j ) + γ||Wi − UV iT ||2. (8)

Updating U: The updating step of the low-rank pro-jection basis U is not as obvious as updating V . How-ever, it is still a convex problem with respect to U whenV is fixed. The optimization over U can be done by up-dating one column Uk once. Let us first look at the firstterm of (7), which is the loss function over the approx-imation. For a data point X(i)

j , the loss function over

it, as shown in (7), is in the form L2(V iUTX(i)j , y

(i)j ),

which could be re-written as L2(tr(V iUTX(i)j ), y

(i)j ) =

L2(tr(UTX(i)j V i), y

(i)j ). By fixing all columns of U ex-

cept for the kth one, the form could be further re-writtenas: L2(tr(UTX

(i)j V i), y

(i)j ) = L2(UT

k X(i)j V i

k +bkij , y(i)j ).

Where bkij =∑

l 6=k UTl X

(i)j V i

l , which is a constant whenall columns of U are fixed except for the kth one. Thegradient ∂L2

∂Ukcan be easily computed as 2(UT

k X(i)j V i

k +

bkij)·X(i)j V i

k when −y(i)j (UTk X

(i)j V i

k + bkij) > 0, whereasbeing 0 otherwise. Similarly, the regularization termin (7) could also be decomposed as ||W − (UkV

Tk +∑

l 6=k UlVTl )||2 = ||W (k) − UkV

Tk ||2, where W (k) =

W −∑

l 6=k UlVTl . Therefore, we get the following sub-

problem, which we also optimize with the Quasi-Newton

method:

minUk

M∑i=1

Ni∑j=1

L2(UTk X

(i)j V i

k+bkij , y(i)j )+γ||W (k)−UkV

Tk ||2,

(9)Note that all the computations described in this section areperformed once and for all during training BSM model.

2.4. Computational Analysis

We analyze the computation cost of applying (2) com-pared with directly applying the original form (1). Thecomputational complexity of evaluating (1) is O(MDN).However, for (2), the low-rank projection Xp = UTX isfirst computed, which has complexityO(LDN), whereL isthe constrained rank. Then, the coefficient matrix V is mul-tiplied with Xp which has complexity of O(MLN). Theo-retically, the ratio of computation complexity of evaluating(2) and (1) is LDN+MLN

MDN = L(D+M)MD . As M , which is the

number of models, becomes very large, the above term be-comes L/D. This means that, as we have more and moremodels, the maximum speed-up we can achieve is D/L,which is the ratio between the original feature dimensionand the low-rank projected feature dimension. As we willshow in Section 3, this value could be very large withoutsacrificing prediction performance, in both object detectionand image classification. One thing to mention is that fea-ture computation, together with model evaluation, consti-tute the actual computation time in our evaluation, thereforethere is discrepancy between the empirical speed-up in ex-periments and the theoretical analysis. We also note thatall the sophisticated acceleration from the hardware side(e.g., GPU, MKL) could also be used combining with ourapproach, which could further accelerates the operation.

2.5. BSM for Object DetectionIn order to apply our method to any object detection al-

gorithm, we replace the model evaluation in (1) with thelow-rank form (2) to get R, an estimate of R, which con-tains the actual responses of all models on all windows.Then, a small fraction of the “model-window” pairs (i, j)with the highest estimated responses in R are re-evaluatedusing the actual modelwi and featureXj (high-dimensionalHOG template which is typically in the order of thousandsof dimensions). Ideally, we would like all the elements inRthat are above the threshold to remain above the thresholdin R. After re-evaluating the “model-window” pairs, stan-dard post-processing steps like Non-maximum Suppression(NMS) are applied to generate the final detections. In termsof the parameters, we then have 3 important parameters inthis pipeline: 1) the constrained rank L in (2), and 2) the ra-tio of the re-evaluated “model-window” pairs 0 < p < 1, aswell as 3) the BSM specific parameter γ which is the weightfor the regularizer in (7).

2.6. BSM for Image Classification

For image classification, each column of X is the fea-ture vector of one image, whereas each column of W is theclassifier for one image category. And UV T is the low-rankapproximation of matrix W . For testing, we simply replacethe form (1) with (2) while keeping all the other compo-nents of the image classification system unchanged. Theparameters in the image classification setting are: 1) L, theconstrained rank, and 2) γ, the regularization weight. Notethat p is not involved in the image classification task sincewe do not apply any re-evaluation in this setting.

3. Experiments3.1. Object Detection

Setup: For the object detection experiments, we com-pare the detection performance using the learned BSM tothe performance achieved by directly applying (1), as wellas the performance of the baseline method which also usesthe approximation form (2). For quantitative measure ofthe performance, we use the standard mean average pre-cision as well as the retrieval rate, which is defined asNretrieved/Ntotal. Nretrieved is the number of the candi-date4 “model-window” pairs occurring in the top p of theestimated responses in matrix R, whereas Ntotal is the totalnumber of the candidate “model-window” pairs in the fullmatrix R. Ideally we want the retrieval rate to be 1 so thatthe BSM-approximated model R mimics the full matrix R.This is an important performance indicator for BSM sincethe closer to 1 the retrieval rate is, the closer the perfor-mance of the BSM-approximated detection system is to theperformance of the full system.

We use the PASCAL VOC 2007 as the dataset for ob-ject detection experiments [8]. As discussed in Section2.4, a large model collection (i.e., large M ) is neces-sary to demonstrate the benefits of our approach. There-fore we choose to use Malisiewicz et al.’s Exemplar SVM(ESVM)[18] method since it has a large number of pre-trained models immediately available, which enables to usea published baseline rather than retraining our own. Specif-ically, there are 12608 ESVM models for exemplars across20 categories. The ESVM models are trained using a sin-gle positive data point with a large number of negative datapoints (both are HOG feature vectors) to serve as the repre-sentation of a particular exemplar. We use 12×12 templatesthroughout our experiments so that D, the dimension of thefeature, is 12×12×31 = 4464 (the templates of size smallerthan 12×12 are padded to be 12×12 by placing them to theupper-left corner of the enlarged templates). For a typicalimage in the PASCAL VOC 2007 testing set, the number ofwindows N , is within 15000-25000.

4By candidate “model-window” pairs, we refer to those pairs that haveresponses above the threshold.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

aero

plan

e

bicy

cle bird

boat

bottl

e

bus

car

cat

chai

r

cow

dini

ngta

ble

dog

hors

e

mot

orbi

ke

pers

on

potte

dpla

nt

shee

p

sofa

train

tvm

onito

r

Retri

eval

Rat

e

Retrieval Rate for all categories, p=0.005

SVDBSM

0.005 0.01 0.015 0.020

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

p

Retri

eval

Rat

e

Retrieval Rate for different p

SVDBSM

Figure 3: The left graph shows the retrieval rate of bothSVD and BSM on all 20 categories of PASCAL VOC 2007when p is fixed to 0.005. Whereas the right graph shows theretrieval rate of both methods with varying p, the results areaveraged over 20 categories. It is clear that our method ismuch better than the baseline in terms of the retrieval ratemeasure over different categories and different p.

We note that even though we use ESVM to test our ap-proach, we can use any detector that can be formulated as(1). In particular, any window scanning detector would fitour approach. We choose ESVM as a convenient way of us-ing a large number of models to demonstrate the benefits ofour approach. In particular, we do not advocate that ESVMis the best choice of detection algorithm on this dataset.

Data for Training the Bilinear Separation Model: Inorder to obtain data for optimizing the objective (7), wesample 500 positive images and 500 negative images foreach category from the trainval set of PASCAL VOC 2007.With these 1000 images per category, the detectors are runto gather the windows with scores above the threshold. Togather the windows which have scores below the threshold,we directly sample 10 random windows per image to forma pool.

Efficient Model Evaluation Using Bilinear SeparationModel: We follow the standard evaluation protocol of PAS-CAL VOC 2007 for ESVM, the SVD baseline (SVD) aswell as the proposed bilinear separation model (BSM). Forthe SVD baseline and BSM, the exemplar models W arereplaced by the rank-L matrices U and V . Then, after ob-taining an approximation R as shown in (2), a small fraction0 < p < 1 of the “model-window” pairs with the highestestimated responses are re-evaluated. Then, NMS and theco-occurrence matrix are applied to get the final detectionresults as in [18].

The detection performance of 6 out 20 categories on thePASCAL VOC 2007, measured by the mean average pre-cision, is shown in Figure 25. As we can see, BSM outper-forms the baseline method by a large margin and has almostthe same performance as the exhaustive model evaluationused in ESVM (measured by mean AP 22.3% vs 22.6%).Also, the simple SVD approach (Section 2.1) does not workwell (mean AP 20.2% vs 22.6%) due to the shortcomingsdiscussed in Section 2.1.

5For the full results on all 20 categories, please see our supplementarymaterial.

cls:aeroplane, HOG glyph cls:aeroplane, invert HOG cls:bicycle, HOG glyph cls:bicycle, invert HOG cls:boat, HOG glyph cls:boat, invert HOG

cls:bottle, HOG glyph cls:bottle, invert HOG cls:car, HOG glyph cls:car, invert HOG cls:cat, HOG glyph cls:cat, invert HOG

cls:person, HOG glyph cls:person, invert HOG cls:sofa, HOG glyph cls:sofa, invert HOG cls:tvmonitor, HOG glyph cls:tvmonitor, invert HOG

Figure 4: The visualization of the column of U , which isthe low-rank projection basis, learned from optimizing (7).We show 6 entries from basis of 6 different categories. Thevisualization is in both HOG glyph[3] and inverted HOG[30]. The spatial invariance is clearly encoded in the tv-monitor category, in which one can see the contours of TVmonitors with different scales.

Another important performance measure that affects thefinal mean AP is the retrieval rate, as defined above. Thecomparison of the retrieval rate of SVD baseline and BSMare shown in Figure 3. As we can see, the retrieval rateof the proposed BSM method largely outperforms the base-line method on all 20 categories, which is consistent withthe superior performance in terms of the mean AP measure.Also, the retrieval rate with varying p (while L is fixed), theratio of the re-evaluated models, is shown in Figure 3. Theresults in Figure 3 are averaged over all 20 categories.

To better understand what has been learned by optimiz-ing the BSM objective (7), we visualize some entries of thelow-rank projection basis, i.e., the columns of matrix U , forvarious categories in Figure 4 using both the HOG glyph[3] and iHOG [30]. There are several interesting points wewould like to note in Figure 4: 1) For categories that havemore coherent appearances across large number of exem-plars, the basis learned through BSM is visually meaning-ful, for example, the categories aeroplane, bicycle, bottle,car, person, etc. 2) For categories that have large intra-category appearance variations, for example the cat cate-gory, the learned basis does not seem to be visually mean-ingful. This is also consistent with the poor detection per-formance on that category. 3) By looking at the basis of thetvmonitor category, we can clearly see the scale invarianceencoded in the basis by the contours of the TV monitorswith different scales.

For all the results in Figure 2, L, which is the rank of Uand V , is set to 0.01×D = 45, whereas p, the ratio of the re-evaluated “model-window” pairs, is set to 0.005. Anotherparameter specifically for BSM, γ in (7), is set to 0.1.

Parameter Sensitivity: In order to understand the ef-fects of different parameters, L, p and γ. We select 5 out of20 categories, namely aeroplane, bicycle, boat, horse andtrain, to perform a parameter search over the grid L = D×{0.004, 0.006, 0.008, 0.01}, p = {0.005, 0.01, 0.015, 0.02}

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:aeroplane

ESVM:0.207

SVD:0.162

BSM:0.191

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:bicycle

ESVM:0.480

SVD:0.443

BSM:0.480

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:cow

ESVM:0.186

SVD:0.147

BSM:0.195

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:diningtable

ESVM:0.111

SVD:0.096

BSM:0.108

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:sheep

ESVM:0.226

SVD:0.195

BSM:0.222

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

cls:train

ESVM:0.367

SVD:0.320

BSM:0.358

Figure 2: The Precision-Recall curve of 6 out 20 categories in PASCAL VOC 2007 with three methods: the standard ESVM(cyan), the SVD baseline (blue) and the proposed BSM (red). As we can see, on most categories, the performance of BSMoutperforms the SVD baseline with a large margin, and is almost the same with ESVM, which evaluates all “model-window”pairs exhaustively.

0.004 0.006 0.008 0.010

0.05

0.1

0.15

0.2

0.25

0.3

L

mA

P

BSM

SVD

0.004 0.006 0.008 0.010

0.1

0.2

0.3

0.4

0.5

0.6

L

Re

trie

va

l R

ate

BSM

SVD

0.005 0.01 0.015 0.020

0.05

0.1

0.15

0.2

0.25

0.3

p

mA

P

BSM

SVD

0.005 0.01 0.015 0.020

0.2

0.4

0.6

0.8

p

Re

trie

va

l R

ate

BSM

SVD

−2 −1 0 1 20.319

0.32

0.321

0.322

0.323

0.324

0.325

log(γ)

mA

P

−2 −1 0 1 2

0.65

0.7

0.75

log(γ)

Re

trie

va

l R

ate

2000 2500 3000 3500 4000 45000

5

10

15

20

25

# of Models

Mo

de

l E

va

lua

tio

n T

ime

(s)

Standard ESVM

BSM

Figure 5: The left 3 columns show the effects of varyingthe parameters L, p and γ. Two out of three parametersL = D×0.01, p = 0.005 and γ = 0.1 are fixed to seethe effect of another parameter. The first row shows the ef-fects of different parameters on the mAP measure whereasthe second row shows the effects on the retrieval rate. Inall plots except for those of γ, which is not a parameter inthe SVD baseline, both the SVD baseline and the proposedBSM are plotted. The graph on the right-end is the com-parison of the computation of model evaluation between di-rectly applying (1) (ESVM) and applying low-rank approx-imation (BSM).

and γ = {0.01, 0.1, 1, 10, 100}. To show the effects ofdifferent parameters, we plot both the mAP measure andretrieval rate in Figure 5. Two of the three values L =D×0.01, p = 0.005 and γ = 0.1 are fixed to show theeffect of another parameter. As we can see from Figure 5,1) The performance increases as we increase L, which isthe rank-constraint. 2) BSM outperforms SVD with a largemargin on all sets of parameters. 3) Also, the performancemeasured by mAP is not sensitive to p, this means that wedo not need a large p, which incurs more computation, toget a good detection performance. 4) BSM prefers a smallγ which emphasizes the importance of the data-dependentterm.

Computation Efficiency: Since the main reason forusing low-rank approximation to the parameter matrix iscomputation efficiency, we perform experiments to studythe computational benefits in object detection with BSMcompared with the standard ESVM detection. Since weare interested in the effect of using low-rank approxima-

tion for model evaluation, we do not consider the time forall pre- and post-processing like computing the HOG andNon Maximum Suppression. We conduct experiments onthe person category of PASCAL VOC 2007 since it is thecategory with the largest number of ESVM models (4690models). We fix the rank L to 0.01×D = 45 where wecan achieve almost the identical performance to standardESVM evaluation and show in Figure 5, the computationas the function of M ranging from 2000 to 4500 models,for both the standard ESVM evaluation and the low-rankapproximation. As we can see, as we increase M , the com-putation of the standard ESVM grows much faster than thatof BSM, which is consistent with our theoretical analysisin Section 2.4. Note that the speed-up of BSM over thestandard ESVM is not as large as the theoretical analy-sis because in practice, the actual computation time con-sists of both feature computation and model evaluation. Wecan achieve around 3 folds speed-up with considering allthe pre- and post- processing with un-optimized MATLABcode.

Finally, as we noted earlier, our aim is to reduce thecomplexity/number of the “model-window” pairs prior tocomputing the scores, not to optimize the matrix productcomputation itself. As a result, other approaches that at-tempt to reduce computation by optimizing the computa-tions involved such as GPU implementations, fast matrixoperations, or quantization can be readily incorporated inthe BSM implementation to achieve further speed-up.

3.2. Image ClassificationFor the image classification task, we use the ImageNet

ILSVRC 2010 dataset which is the only version of ILSVRCthat has released both its testing images and the ground truthlabels for them6. To best use the existing pipeline, we adoptthe state-of-the-art feature generated by Krizhevsky et al.’snetwork [15] and trained our own 1000-way softmax regres-sor W on ILSVRC 2010 training set. We use U and Vlearned from both SVD and BSM, as we described in Sec-

6Note that both the number of categories and images are almost identi-cal to the more recent ILSVRC 2012.

tion 2.6, to obtain an approximation ofR as R7. The perfor-mance of the standard softmax classifierW , and the approx-imations with SVD and BSM is shown in Table 1. First, aswe increase the rank L, we get better and better classifica-tion performance as we expected. Also, our method couldachieve performance that is not only much better than thatof the SVD baseline, but also better than the original soft-max regression classifier when using only around 4% rankrelative to D. We suspect this is because we reduce theoverfitting by projecting down the feature dimension.L/D 0.005 0.01 0.015 0.02 0.04 0.06 0.08 0.1 SOFTMAXSVD 0.191 0.366 0.470 0.534 0.635 0.672 0.690 0.703 0.728BSM 0.444 0.634 0.690 0.709 0.735 0.745 0.750 0.750

Table 1: Comparison of the classification accuracy betweenSVD and BSM, as well as the standard softmax regressionclassifier. The results are shown with different L/D, whichis the ratio between L, the constrained rank, and D, thedimension of the feature (In this case 4096).

4. ConclusionIn this paper, we have investigated the problem of effi-

cient model evaluation with a bilinear separation model. Byoptimizing an objective function which is data-dependentand directly related to the prediction performance for thetarget task, we are able to achieve the same or even bet-ter performance for both object detection (PASCAL VOC2007) and image classification (ImageNet ILSVRC 2010)tasks with much less computation, i.e., more precisely, withgrowth in computation much slower than the growth in thenumber of models, which is consistent with our theoreticalanalysis on computation efficiency.

References[1] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts

for automatic object segmentation. In CVPR 2010, pages 3241–3248.IEEE. 2

[2] A. Coates, P. Baumstarck, Q. Le, and A. Y. Ng. Scalable learningfor object detection with gpu hardware. In IROS 2009, pages 4287–4293. IEEE. 2

[3] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In CVPR 2005, pages 886–893. IEEE. 1, 6

[4] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan,and J. Yagnik. Fast, accurate detection of 100,000 object classes ona single machine. In CVPR 2013, pages 1814–1821. IEEE. 2

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Ima-geNet: A Large-Scale Hierarchical Image Database. In CVPR 2009.2

[6] C. Dubout and F. Fleuret. Exact acceleration of linear object detec-tors. In ECCV 2012, pages 301–311. Springer. 2

[7] I. Endres and D. Hoiem. Category independent object proposals. InECCV 2010, pages 575–588. Springer. 2

[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man. The pascal visual object classes (voc) challenge. Internationaljournal of computer vision, 88(2):303–338, 2010. 5

7We use the same data from ILSVRC 2010 training set to train theBSM.

[9] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade ob-ject detection with deformable part models. In CVPR 2010, pages2241–2248. IEEE. 2

[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.Object detection with discriminatively trained part-based models.TPAMI 2010, 32(9):1627–1645. 1

[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier-archies for accurate object detection and semantic segmentation. InCVPR 2014. 2

[12] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik. Learning binarycodes for high-dimensional data using bilinear projections. In CVPR2013, pages 484–491. IEEE. 2

[13] J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimiza-tion with biconvex functions: a survey and extensions. MathematicalMethods of Operations Research, 66(3):373–407, 2007. 2, 4

[14] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrela-tion for clustering and classification. In ECCV 2012, pages 459–472.Springer. 1

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica-tion with deep convolutional neural networks. In NIPS 2012, vol-ume 1, page 4. 2, 7

[16] C.-P. Lee and C.-J. Lin. A study on l2-loss (squared hinge-loss) mul-ticlass svm. Neural computation, 25(5):1302–1323, 2013. 4

[17] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang.Large-scale image classification: fast feature extraction and svmtraining. In CVPR 2011, pages 1689–1696. IEEE. 2

[18] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV 2011, pages 89–96.IEEE. 1, 2, 5, 6

[19] S. Manen, M. Guillaumin, L. Van Gool, and K. Leuven. Prime objectproposals with randomized prims algorithm. In ICCV 2013. 2

[20] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approachfor fast deformable object detection. In CVPR 2011, pages 1353–1360. IEEE. 2

[21] H. Pirsiavash and D. Ramanan. Steerable part models. In CVPR2012, pages 3226–3233. IEEE. 2

[22] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Bilinear classifiers forvisual recognition. In NIPS 2009, pages 1482–1490. 1, 2

[23] J. D. Rennie and N. Srebro. Fast maximum margin matrix factor-ization for collaborative prediction. In ICML 2005, pages 713–719.ACM. 4

[24] M. A. Sadeghi and D. Forsyth. Fast template evaluation with vectorquantization. In NIPS 2013, pages 2949–2957. 2

[25] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image clas-sification with the fisher vector: Theory and practice. Internationaljournal of computer vision, 105(3):222–245, 2013. 2

[26] H. O. Song, T. Darrell, and R. B. Girshick. Discriminatively activatedsparselets. In ICML 2013, pages 196–204. 1, 2, 3

[27] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Geyer,P. Felzenszwalb, and T. Darrell. Sparselet models for efficient mul-ticlass object detection. In ECCV 2012, pages 802–815. Springer. 1,2, 3

[28] J. B. Tenenbaum and W. T. Freeman. Separating style and contentwith bilinear models. Neural computation, 12(6):1247–1283, 2000.1, 2

[29] P. Viola and M. Jones. Robust real-time object detection. Interna-tional Journal of Computer Vision, 4:34–47, 2001. 2

[30] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. Invert-ing and visualizing features for object detection. arXiv preprintarXiv:1212.2278, 2012. 6

[31] J. Ye, R. Janardan, Q. Li, et al. Two-dimensional linear discriminantanalysis. In NIPS 2004, volume 4, page 4. 2

efﬁcient model evaluation with bilinear separation model · using a bilinear separation model...

Documents