a possible reason for why data-driven beats theory-driven ... · image sets encouraged exploration...

A Possible Reason for why Data-Driven Beats Theory-Driven Computer Vision

John K. Tsotsos, Iuliia KotserubaYork University

tsotsos, yulia [email protected]

Alexander AndreopoulosIBM Research

[email protected]

Yulong WuAion Foundation

[email protected]

Abstract

Why do some continue to wonder about the success anddominance of deep learning methods in computer visionand AI? Is it not enough that these methods provide practi-cal solutions to many problems? Well no, it is not enough,at least for those who feel there should be a science thatunderpins all of this and that we should have a clear under-standing of how this success was achieved. Here, this paperproposes that the dominance we are witnessing would nothave been possible by the methods of deep learning alone:the tacit change has been the evolution of empirical prac-tice in computer vision and AI over the past decades. Wedemonstrate this by examining the distribution of sensorsettings in vision datasets and performance of both clas-sic and deep learning algorithms under various camera set-tings. This reveals a strong mismatch between optimal per-formance ranges of classical theory-driven algorithms andsensor setting distributions in the common vision datasets,while data-driven models were trained for those datasets.The head-to-head comparisons between data-driven andtheory-driven models were therefore unknowingly biasedagainst the theory-driven models.

1. IntroductionThere are many classic volumes that define the field

of computer vision (e.g., Rosenfeld 1976 [34], Marr 1982[27], Ballard & Brown 1982 [4], Horn 1986 [18], Koen-derink 1990 [21], Faugeras 1993 [11], Szeliski 2010 [42]and more). There, the theoretical foundations of image andvideo processing, analysis, and perception are developedboth theoretically and practically. In part, these representwhat we call here theory-driven computer vision models.

A geometrical and physical understanding of the imageformation process, from illuminant to camera optics to im-age creation, as well as the material properties of the sur-faces that interact with incident light and the motions of ob-jects in a scene, was mathematically modeled in the hopethat, when those equations were simulated by a computer,they would reveal the kinds of structures that human vision

also perceives. An excellent example of this developmentprocess appears in [52], which details the importance ofphysical and mathematical constraints for the theory devel-opment process with many examples from early work onedges, textures and shape analysis. It is difficult to deny thetheoretical validity of those approaches and from the earli-est days of computer vision, the performance of these the-ory solutions had always appeared promising, with a largeliterature supporting this (see the several articles in [39] onall aspects of computer vision for reviews of early work) aswell as many commercial successes.

However, during most of the history of computer vision,the discipline suffered from two main problems (see [2]).Firstly, computational power and memory were too meagreto deal with the requirements of vision (theoretically shownin [46, 45]). Secondly, the availability of large sets of testdata that could be shared and could permit replication ofresults was limited. An appropriate empirical methodologyand tradition to guide such testing and replication was alsomissing.

The first problem gradually improved as Moore’s Lawplayed out. Especially important, was the advent of GPUsin the late 1990s [30], with their general availability a fewyears later. Major progress was made on the second prob-lem with the introduction of what might be the first largecollection of images for computer vision testing, namely,MNIST [23]. During the preceding decade, advances inmachine learning began to have an important impact oncomputer vision (e.g., Turk & Pentland Eigenface system[47], based on learning low-dimensional representations forfaces, following Sirovich & Kirby [40]).

Whereas the scarcity of data precluded extensive use oflearning methods in the early days, the emergence of largeimage sets encouraged exploration of how systems mightlearn regularities, say in object appearance, over large setsof exemplars. Earlier papers pointed to the utility of im-ages of handwritten digits in testing recognition and learn-ing methods (e.g., work by Hinton et al. [17]) so the cre-ation of the MNIST set was both timely and impactful. Asa result, the community witnessed the emergence of data-driven computer vision models created by extracting statisti-

arX

iv:1

908.

1093

3v2

[cs

.CV

] 6

Sep

201

9

cal regularities from a large number of image samples. TheMNIST set was soon joined by others. The PASCAL Vi-sual Object Classes (VOC) Challenge began in 2005 [10],ImageNet in 2010 [37], and more. The contribution of thesedata sets and challenges is undeniable towards the accelera-tion of developments in computer vision.

It is widely accepted that the AlexNet paper [22] wasa turning point in the discipline. It outperformed theory-driven approaches by a wide margin and thus began thesurrender of those methods to data-driven ones. But howdid that success come about? Was it simply the fact thatAlexNet was a great feat of engineering? Likely this is true;but here, we suggest that there may be a bit more to it. Em-pirical methodology matters. How this has evolved withinthe community had also played a role, perhaps a key role, sothat the theory-driven computer vision paradigms never hada chance. In the rush to celebrate and exploit this success,no one noticed.

The next sections will play out our path to supporting thisassertion. It is interesting to note that what led us to this waswork towards understanding how active setting of cameraparameters affects certain computer vision algorithms. Wesummarize the argument here.

2. Effect of Sensor Settings for Interest Pointand Saliency Algorithms

Previous work explored the use of interest point/saliencyalgorithms in an active sensing context and investigated howthey perform with varying camera parameters in order todevelop a method for dynamically setting those parametersdepending on task and context. The experiments revealeda very strong dependence on settings for a range of algo-rithms. The patterns seemed orderly as if determined bysome physical law; the data exhibited a strong and clearstructure. A brief summary is provided here with more de-tail in Andreopoulos & Tsotsos [1].

The authors evaluated the effects of camera shutter speedand voltage gain, under simultaneous changes in illumina-tion, and demonstrated significant differences in the sensi-tivities of popular vision algorithms under those variations.They began by creating a dataset that reflected differentcameras, camera settings, and illumination levels. Theyused one CCD sensor (Bumblebee stereo camera, Point-Grey Research) and one CMOS sensor (FireflyMV camera,PointGrey Research) to obtain the datasets. The permissibleshutter exposure time and gain were uniformly quantizedwith 8 samples in each dimension. There were an equalnumber of samples for each combination of sensor settings.

Then, the resulting images were processed by several al-gorithms, popular before the time of writing: Harris-Affineand Hessian-Affine region detectors [29], Scale Saliencyinterest-point detector [20], Maximally Stable Extremal Re-gions algorithm (MSER) [28], and Speeded-Up Robust Fea-

(a) Harris (b) Hessian

(c) MSER (d) SURF

(e) Scale Saliency

Figure 1: Adapted from Andreopoulos & Tsotsos [1]showing the performance of various descriptors in terms ofprecision-recall values for each combination of sensor set-tings (collapsed across all illumination conditions tested).The descriptors tested are (a) Harris-Affine; (b) Hessian-Affine; (c) MSER; (d) SURF; (e) Scale Saliency. In allplots shutter speed increases from top to bottom and gainincreases from left to right. In [1] precision and recall val-ues are thresholded at 0.5, so we set the appropriate bins to0.5 in the images above.

tures extraction algorithm (SURF) [5].Performance was determined by first selecting a tar-

get image for each scenario (camera setting and illumina-tion combination) which became the image with respect towhich the change in the detected features under differentshutter/gain values and different scene illumination is eval-uated for a different image. The target image is identicalregardless of the interest-point or saliency algorithm being

tested for a given dataset, so that the results acquired withdifferent algorithms are comparable (the scene is unchang-ing). Precision and recall are then determined based on howwell an algorithm detects those target features as the sce-nario varies for the other images in the dataset. This is fullydetailed in [1].

The results, summarized in Figure 1, show a strong struc-ture defining the good performance ranges of each algo-rithm. They all are have different shape and point outthat sensor and illumination settings that lead to good per-formance are particular to each algorithm. This suggeststhat common datasets used to evaluate vision algorithmsmay suffer from a significant sensor specific bias which canmake many of the experimental methodologies used to eval-uate vision algorithms unable to provide results that gener-alize in less controlled environments. Simply put, if onewished to use one of these specific algorithms for a particu-lar application, then it is necessary to ensure that the imagesprocessed are acquired using the sensor setting ranges thatyield good performance (see Figure 1). Such considerationsare rarely observed.

Also tested in [1] but not shown here are the saliencyalgorithms of Itti-Koch-Niebur [19] and the Bruce-TsotsosAIM algorithm [6]). Those tests also revealed that goodperformance depends on different illumination conditionsfor different sensor settings. This demonstrates the inabil-ity of a single constant shutter/gain value to optimally dis-cern the most salient image regions under modest changesin the illumination, or as the surface reflectance propertieschange. Some of their worst performance appeared whenusing the auto-gain and auto-exposure mechanism. Suchsensor mechanisms, which rely on the mean image inten-sity, are very commonly used in practice yet lead to ex-tremely poor and inconsistent results when those images areprocessed using any of these algorithms.

3. Effect of Sensor Settings on Object DetectionAlgorithms

The same test for more recent recognition algorithms,both classic and deep learning methods, was performed byWu & Tsotsos [49]. The authors created a dataset contain-ing 2240 images in total, by viewing 5 different objects (bi-cycle, bottle, chair, potted plant and tv monitor), at 7 levelsof illumination and with 64 camera configurations (8 shut-ter speeds, 8 voltage gains). As in the previous experiment,there were an equal number of samples for each combina-tion of sensor settings. To accurately measure the illumina-tion of the scene, a Yoctopuce light sensor was used. Also,intensity-controllable light bulbs were used to achieve dif-ferent light conditions, 50lx, 200lx, 400lx, 800lx, 1600lxand 3200lx. The digital camera was a Point Grey Flea3.The allowed ranges of shutter speed and voltage gain wereuniformly sampled into 8 distinct values in each dimension.

1 2 3 4 5 6 7 8Gain

1

2

3

4

5

6

7

8

Shutter S

peed

0.82 0.51 0.82 0.97 0.97 0.82 0.55 0.67

1.00 0.82 0.55 0.55 0.73 0.55 0.45 0.45

0.82 0.55 0.55 0.55 0.55 0.51 0.45 0.45

0.55 0.55 0.82 0.55 0.55 0.45 0.45 0.27

0.55 0.55 0.55 0.55 0.45 0.45 0.45 0.00

0.55 0.55 0.55 0.55 0.45 0.45 0.27 0.00

0.55 0.55 0.55 0.45 0.45 0.27 0.00 0.00

0.55 0.55 0.55 0.45 0.27 0.27 0.00 0.00

(a) R-CNN

1 2 3 4 5 6 7 8Gain

1

2

3

4

5

6

7

8

Shutter S

peed

0.52 0.55 0.82 0.82 0.76 0.82 0.52 0.45

0.82 0.82 0.82 0.82 0.55 0.55 0.55 0.52

0.82 0.82 0.55 0.55 0.55 0.55 0.52 0.39

0.82 0.55 0.55 0.55 0.55 0.52 0.45 0.27

0.82 0.55 0.55 0.55 0.55 0.45 0.45 0.00

0.55 0.55 0.55 0.55 0.55 0.45 0.27 0.00

0.55 0.55 0.55 0.55 0.45 0.45 0.00 0.00

0.55 0.55 0.55 0.55 0.45 0.45 0.00 0.00

(b) SPP-net

1 2 3 4 5 6 7 8Gain

1

2

3

4

5

6

7

8

Shutter S

peed

0.52 0.55 0.52 0.52 0.82 0.55 0.55 0.55

0.52 0.55 0.55 0.62 0.68 0.73 0.45 0.52

0.52 0.52 0.82 0.48 0.52 0.55 0.52 0.45

0.51 0.66 0.60 0.49 0.45 0.52 0.45 0.39

0.70 0.49 0.50 0.45 0.50 0.51 0.39 0.36

0.70 0.82 0.51 0.45 0.45 0.39 0.39 0.04

0.70 0.45 0.50 0.50 0.52 0.45 0.39 0.00

0.65 0.82 0.45 0.51 0.52 0.45 0.36 0.01

(c) DPM

1 2 3 4 5 6 7 8Gain

1

2

3

4

5

6

7

8

Shutter S

peed

0.29 0.27 0.48 0.51 0.39 0.39 0.45 0.45

0.76 0.55 0.63 0.36 0.45 0.55 0.55 0.55

0.76 0.48 0.55 0.55 0.45 0.55 0.55 0.45

0.46 0.55 0.45 0.55 0.55 0.32 0.45 0.00

0.45 0.55 0.55 0.52 0.51 0.32 0.39 0.00

0.45 0.50 0.50 0.55 0.50 0.45 0.00 0.00

0.45 0.45 0.52 0.27 0.47 0.45 0.00 0.00

0.55 0.45 0.52 0.55 0.27 0.00 0.00 0.00

(d) BoW

Figure 2: Results of evaluation of object detection algo-rithms on the custom dataset for different sets of shutterspeed and gain values adapted from Wu & Tsotsos [49].In all plots shutter speed increases from top to bottom andgain increases from left to right. As in Figure 1, mAP values(as defined in [38]) are shown for each sensor setting com-bination and a particular illumination level. Here we showthe results for the high illumination conditions, which mostclosely reflect what is found in the common vision datasets.

Four popular object detection algorithms were selectedfor evaluation: the Deformable Part Models (DPM) [12],the Bag-of-Words Model with Spatial Pyramid Matching(BoW) [48], the Regions with Convolutional Neural Net-works (R-CNN) [14], and the Spatial Pyramid Pooling inDeep Convolutional Networks (SPP-net) [16].

The results are shown in Figure 2. This time, mean av-erage precision (mAP) values were not thresholded and areplotted intact. The original work presented these plots foreach of the tested illumination levels and demonstrated thatperformance depends significantly on illumination level aswell as sensor settings and does not easily generalize acrossthese variables. As before, if one wished to use one of thesespecific algorithms for a particular application, then it isnecessary to ensure that the images processed are acquiredusing the sensor setting ranges that yield good performance(see Figure 2).

In general, it can be seen that there is less orderly struc-ture when compared to the previous set of tests (thus makingany characterization of ‘good performing’ sensor settings

more difficult) and the authors wondered about the reasonunderlying this difference. Could the difference be due to anuneven distribution of training samples along those dimen-sions? And if so, could overall performance be influencedby such bias?

4. Distributions of Sensor Parameters in Com-mon Computer Vision Datasets

As mentioned, the two above studies caused us to be cu-rious about the reasons behind the uneven and unexpectedperformance patterns across algorithms. After thoroughverifications of the methods employed, we concluded thatsome imbalance in data distribution across sensing parame-ters might be the cause.

Surprisingly, among works on various biases in visiondatasets, few acknowledge the existence of sensor bias (orcapture bias [43]) and none provide quantitative statistics.Hence, we set out to test this. We selected two commondatasets, Common Objects in Context (COCO) [25] andVOC 2007, the dataset used in the PASCAL Visual ObjectClasses Challenge in 2007 [9].

Since both datasets consist of images gathered fromFlickr, we used Flickr API to recover EXIF data (a set oftags for camera settings provided by the camera vendor) foreach image. In the COCO dataset 59% and 58% of train andvalidation data respectively had EXIF data available. Weuse the trainval35K split commonly used for trainingobject detection algorithms (e.g. SSD[26], YOLOv3 [31],and RetinaNet [24]).

In the VOC2007 dataset 31% of train and test data hadEXIF data available, however as Figure 3 shows, the distri-butions of exposure times and ISO settings for photographsin the COCO and VOC2007 datasets are similar. Note thatmost images are taken with the automatic ISO settings suchas 100, 200, 400 and 800 (recall the earlier test results re-garding automatic camera settings). Most exposure timesare short (below 0.1s) and spike around 1/60s. This agreeswith the large-scale statistical analysis of millions of imagesfound online in [51].

Using shutter speed, f-number and ISO we can computeexposure value (EV) using the following formula as in [50]:

2EV = f2

t +ISOsetting

ISO100,

where EV is exposure value, f is the f-number, t isthe exposure time and ISOsetting is the ISO used to takethe photograph. From EV we can derive the illuminationlevel (assuming that the photograph is properly exposed) asin [51]. We define low illumination between -4 and 7 EV(up to 320lx), mid-level illumination between 8 and 10 EV(640 to 2560lx) and high-level illumination above 11 EV(more than 5120lx) which approximately matches the setupin [49]. Figure 4 gives the distributions of exposure times(shutter speeds) in the COCO and VOC2007 datasets. Ta-

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

12.81 26.43 22.50 18.49 8.73 4.41 1.23 <1

<1 <1 <1 <1 <1 <1 <1 <1

<1 <1 <1 <1 <1 <1 <1 <1

<1 <1 <1 <1 <1 <1

<1 <1 <1 <1 <1 <1 <1

<1 <1 <1

<1 <1 <1 <1 <1 <1 <1

<1 <1 <1 <1 <1 <1 <1

(a) COCO train set

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

12.41 26.14 20.16 19.99 9.50 4.53 1.29 <1

<1 <1 <1 <1 <1 <1 <1

<1 <1 <1 <1 <1 <1

<1 <1 <1 <1

<1 <1 <1

<1

<1 <1 <1 <1

<1 <1 <1 <1 <1

(b) COCO val set

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

17.55 36.11 16.97 14.37 3.94 2.80 <1

<1 1.14 1.65 <1 <1 <1

<1 <1 <1 <1 <1

<1 <1 <1

<1 <1 <1 <1

<1

<1

<1 <1 <1 <1 <1

(c) VOC2007 train set

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

18.49 33.48 15.64 16.16 4.61 3.70 <1

<1 1.30 2.08 <1 <1

<1 <1 <1 <1 <1 <1

<1 <1 <1

<1 <1 <1

<1

<1 <1 <1

<1 <1 <1

(d) VOC2007 test set

Figure 3: Distribution of exposure time and ISO in training(a) and validation (b) sets in terms of % of the total imagesin the COCO set (with EXIF data). Distribution of exposuretime and ISO in training (c) and test (d) sets in VOC2007dataset (with available EXIF data). Note the uneven distri-bution of images in the bins. Some of the bins are emptysince there are no images in the dataset obtained with thosecamera settings.

ble 1 shows the image counts in each illumination level, notsurpisingly, nearly 90% of all images are acquired underhigh to medium illumination conditions.

Illumination LevelHigh Med Low

Train 42,045 (61%) 18,456 (27%) 8054 (12%)Val 1,832 (62%) 771 (27%) 354 (12%)

Table 1: Number of samples taken in high/medium/lowillumination conditions in training and validation split inCOCO dataset.

If our hypothesis that some imbalance in data distribu-tion across sensing parameters might be the cause of theeven performance was correct, then this should be revealedby examining algorithm performance across these sensorsetting ranges.

5. Object Detection on Images With DifferentSensor Parameters from COCO Dataset

We next investigated how different sensor parametersrepresented in those datasets affect the performance of ob-

0.0 0.2 0.4 0.6 0.8 1.0Exposure Time

0.00

0.02

0.04

0.06

0.08

% im

ages

valtrain

(a) Exposures times in COCO

0.0 0.2 0.4 0.6 0.8 1.0Exposure Time

0.00

0.02

0.04

0.06

0.08

0.10

0.12

% im

ages

valtrain

(b) Exposure times in VOC2007

0 200 400 600 800 1000ISO

0.00

0.05

0.10

0.15

0.20

% im

ages

valtrain

(c) ISO values in COCO

0 200 400 600 800 1000ISO

0.00

0.05

0.10

0.15

0.20

0.25

0.30

% im

ages

valtrain

(d) ISO values in VOC2007

Figure 4: Distributions of exposure times (shutter speeds)in (a) COCO and (b) VOC2007 datasets. Distributions ofISO values in (c) COCO and (d) VOC2007 datasets. Trainand validation data splits are shown in green and blue.

ject detection algorithms. We used several state-of-theart object detection algorithms, specifically, Faster R-CNN[32], Mask R-CNN [15], YOLOv3 [31] and RetinaNet [24].All are trained on COCO trainval35K set. Figures 3aand 3b show the percentages of images for a range of theshutter speed and ISO settings in COCO train and valida-tion sets. The bin edges of heatmaps approximately matchthe ranges reported in [1] and [49]. Since shutter speeds ofthe cameras used in the previous works were limited to 1s,in our setup all images with exposure time > 1s fall intothe last bin and exposure time values between 0 and 1 aresplit into 8 equal intervals. Both [1] and [49] report gain,which is not available on most consumer cameras, thereforewe use ISO values as a proxy. The following ISO bin ranges[0, 100, 200, 400, 800, 1600, 3200, 6400, 10000] approx-imately correspond to the gain values used in the previousworks.

Figure 5 shows evaluation results in terms of mean av-erage precision (mAP) for state-of-the-art object detectionalgorithms trained on COCO dataset and evaluated on theportion of COCO 5K minival set with available EXIFdata and presented in the same style as the previous tests. Itis difficult to compare our results with the results of the pre-vious works directly because of the differences in the eval-uation datasets, algorithms (interest point detection vs ob-ject detection), camera parameters (gain vs ISO), inabilityto precisely establish illumination level in common visiondatasets and possible inconsistencies in computing averageprecision in each case. Nevertheless, a great deal can beobserved.

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

0.75 0.74 0.72 0.73 0.71 0.74 0.78 0.89

0.75 0.72 0.70 0.75 0.79 0.80 0.80

0.92 0.78 0.87 0.88 0.82 0.70

0.90 1.00 1.00 0.82

0.75 0.50 1.00

1.00

0.63 0.92 1.00 0.32

0.74 0.65 1.00 0.86 0.95

(a) YOLOv3

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

0.67 0.64 0.65 0.66 0.61 0.67 0.65 0.72

0.57 0.66 0.54 0.68 0.73 0.72 0.50

0.79 0.73 0.90 0.82 0.73 0.50

0.90 1.00 1.00 0.72

0.70 0.50 1.00

1.00

0.33 0.86 0.96 0.23

0.66 0.50 1.00 0.96 0.89

(b) Faster R-CNN

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

0.66 0.66 0.65 0.66 0.62 0.68 0.73 0.79

0.66 0.68 0.53 0.69 0.74 0.76 0.40

0.88 0.70 0.86 0.86 0.76 0.53

0.90 1.00 1.00 0.66

0.69 0.50 1.00

0.50

0.33 0.88 0.95 0.48

0.73 0.54 1.00 0.76 0.80

(c) Mask R-CNN

0 100 200 400 800 1600 3200 6400ISO

0.0

0.125

0.25

0.375

0.5

0.625

0.75

0.875

Shutter S

peed

0.63 0.61 0.62 0.61 0.57 0.66 0.61 0.67

0.67 0.59 0.53 0.66 0.68 0.69 0.55

0.81 0.68 0.77 0.85 0.79 0.67

0.80 1.00 1.00 0.75

0.75 0.50 1.00

0.50

0.47 0.86 0.91 0.29

0.66 0.49 1.00 0.96 0.86

(d) RetinaNet

Figure 5: Results of evaluation of modern object detectionalgorithms on the minival subset of COCO for differentsets of shutter speed and gain values. Mean average preci-sion for bounding boxes at 0.5 IoU (mAP IoU=.50) is com-puted for each bin using COCO API.

Note that nearly 90% of training and validation data inCOCO is concentrated in the top row of the diagram (veryshort exposure times and ISO values of up to 800). Fig-ure 5 reveals very similar results from all 4 algorithms thatare trained on this dataset suggesting possible training bias.It is also apparent that the mAP values in the top row areconsistent with the reported performance of the algorithmsbut fluctuate wildly in bins that contain less representativecamera parameter ranges. It is hard to attribute this fluc-tuation entirely to the sensor bias, as other factors may beat play (e.g. types and number of objects, small numberof images in the underrepresented bins). This it is some-thing that should be investigated further. It is never a usefulproperty for an algorithm to display such significant sensi-tivity to small parameter changes. For imaging, one mightexpect a small shift in shutter speed, for example, to leadto only small changes in subsequent detection performance;this experiment shows this is not the case.

6. DiscussionThe above experiments revealed several key points.

First, theory-driven algorithms seem to have an orderly pat-tern of performance with respect to the sensor settings weexamined. This may be due to their analytic definitions;they were not designed to be parameterized for the full

range of sensor settings. One can easily conclude that ifgood performance is sought from any of these algorithms,they should be employed with cameras set to the algorithm’sinherent optimal ranges. Of course, this limited test needsto be greatly expanded before finalizing this conclusion.However, in an engineering context this makes good sense:commercial products are commonly wrapped in detailed in-structions about how to use and not use a product to en-sure expected performance. In retrospect, knowledge of al-gorithm performance under varying sensor settings wouldhave enhanced their design.

Second, the same test on modern algorithms reveals ahaphazard performance pattern. It might be that some ofthe large variations are due to biases in the data, maybesome due to the particular objects in question, Others maybe due to the properties of the network architectures (e.g.,[35]). This also needs much deeper analysis. Nevertheless,the performance pattern of Figure 5 is not entirely inconsis-tent with the training sample distributions of Figure 3 and4. The often seen claims of generalizable behavior for suchalgorithms would lead one to think that they might not beso sensitive to biases or other variations but they clearly are.Further, even though the performance shown in Figure 5 ef-fectively only uses about 60% of the dataset as tests, theoriginal system was trained on the full dataset.

Third, examination of two popular image sets, VOC2007and COCO, shows that image metadata (sensor settings,camera pose, illumination, etc.) is often not available. Thismeans that for any given ‘in the wild’ set of images, the per-formance of data-driven methods may be predicted by howwell the distribution of images along dimensions of sensorsetting and illumination parameters in a test set matches thedistribution resulting from the training set. This requiresfurther verification.

Fourth, and perhaps most revealing, we see that the com-parison of high performing regions in Figure 1 with distri-bution of sensor settings of Figure 4 reveals only a tiny over-lap: on those datasets, those algorithms would all performterribly. This is not due to their bad design. It is due partlyto their use outside of their design specifications, in thiscase, empirically obtained camera sensor settings. Morecomplete consideration of empirical methodology - whereall aspects of the data are recorded and used to ensure re-producibility - might have led to different results in a head-to-head comparison.

This may also suggest that generalization of performanceacross image sets may not be a sensible expectation. In fact,the direct comparison of theory-driven vs data-driven meth-ods was inappropriate because the theory-driven methodsnever had a chance due to the test set mis-matches just de-scribed. Any of the interest point algorithms in our first testcould be applied to any image of either COCO or VOC2007datasets and would perform terribly only because the test

ranges are not matched to their inherent operating ranges.In some sense, the community has largely ignored sen-

sor settings and their relevance as described here. This maybe due to the belief that since humans can be shown imagesfrom any source, representing any sort of object or scene,familiar or unfamiliar, and provide reasonable responses asto their content, computer vision systems should also. Thisgoal is “baked in” to the overall research enterprise and vari-ations are considered either as nuisance or good tests of gen-eralization ability. As an example of the latter, there are anumber of studies investigating the effects of various imagedegradations on the performance of the deep-learning basedalgorithms. These include both artificially distorted images[33, 13, 36, 7] as well as more realistic transformations, forexample, viewing the image through a cheap webcamera[44].

However, the proposition itself is not well-formed. It issimply untrue that humans, when shown an image out ofthe vast range of possibilities, perform in the same man-ner with respect to time to respond, accuracy of response,manner of response, and internal processing strategy. Thebehavioral literature solidly proves this. Even the compu-tational literature points to this. While humans have beenshown to be more robust than state-of-the-art deep networksto nearly all image degradations examined, human perfor-mance still gradually drops off as the levels of distortionsincrease [8, 13].

As we mentioned earlier, most of the experiments in-vestigating the robustness of the deep networks to imagedistortions do not tie those to the changes resulting fromcamera settings. Perhaps changing ISO from 100 to 200,and the resulting increase in noise, would not be easily no-ticeable to human eye, but may have a measurable effecton the performance of CNNs which are known to be af-fected by very small changes in the images (e.g. smallaffine transformations[3], various types of noise [13]). As[41] shows changing a single pixel in the right place is allit takes sometimes. It should be clear that the “baked in”belief needs much more nuance and a significantly deeperanalysis, if not outright rejection.

Finally, as can be seen in Figures 3 and 5, the variabilityrequired to train is not even available in the large datasetswe considered. The distributions of images across theseparameters was very uneven so training algorithms are im-peded with respect to learning the variations. It might bea good practice to require similar demonstrations of imagedistributions across the relevant parameters in order to en-sure that no only training, but comparative evaluations, arepropery performed.

7. ConclusionsWith all due respect to the terrific advances made in com-

puter vision by the application of deep learning methods -

here termed data-driven models - we propose here, and pro-vide some justification, that the empirical methodology thatled to the turning point in the discipline was based on anoversight that none of us noticed at the time. Sensor set-tings matter and each algorithm, perhaps most especiallythe theory-driven ones, have ranges within which one mightexpect good performance and ranges where one should notexpect it. Testing outside the ranges is not only unfair butcompletely inappropriate. Theory-driven algorithms werecompared against data-driven algorithms on datasets unrep-resentative of the theory-driven algorithm operating rangesbut on which the data-driven algorithms were trained.

If sensor settings (maybe also illumination levels or otherimaging aspects) played a role in the large scale testingof the classic theory-driven algorithms, then perhaps theywould have performed at high levels. The community didnot do this. In comparing the data-driven with theory-drivenalgorithms the distribution of camera settings (unknown toeveryone) favored the data-driven algorithms because theywere trained on such a random distribution while the theory-driven algorithms were tested on data for which they werenot designed to operate. But no one realized this at the time.Thus the empirical strategy favored data-driven models.

If the empirical strategy specifically included a fullerspecification of optimal operating ranges for all algorithmsand each algorithm were tested accordingly, what wouldtheir performance rankings be? The theory-driven algo-rithms would have been tested only on images from cam-era settings for which they were designed and perhaps theperformance gap would have been smaller. Such greaterspecificity in experimental design seems more common inother disciplines. An empirical method involves the use ofobjective, quantitative observation in a systematically con-trolled, replicable situation, in order to test or refine a the-ory. The key features of the experiment are control overvariables (independent, dependent and extraneous), carefulobjective measurement and establishing cause and effect re-lationships. At the very least, a discussion on how to firmup empiricism in computer vision needs to take place.

The impact of this is great. Knowledge of the sensor set-ting ranges that lead to poor performance would allow oneto ensure failure of a particular algorithm. This is as truefor any head-to-head technical competition as it is for anadversary. Non-linear performance of any algorithm alongone or more parameter settings is a highly undesirable char-acteristic. It may be that a blend of the approaches, data-driven and theory-driven, would permit better specificationtowards providing linear performance across the relevantparameters but within a framework where learning can oc-cur using datasets that better represent the target population.

AcknowledgementsThis research was funded by the Natural Sciences and

Engineering Research Council of Canada, the Canada Re-

search Chairs Program and the Air Force Office for Sci-entific Research. Part of this work (sections 2 and 3)was performed when AA and YW were graduate stu-dents in the Tsotsos Lab. The authors thank Konstan-tine Tsotsos who provided useful comments and sugges-tions.

References[1] A. Andreopoulos and J. K. Tsotsos. On sensor bias in exper-

imental methods for comparing interest-point, saliency, andrecognition algorithms. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 34(1), 2011. 2, 3, 5

[2] A. Andreopoulos and J. K. Tsotsos. 50 years of object recog-nition: Directions forward. Computer vision and image un-derstanding, 117(8), 2013. 1

[3] A. Azulay and Y. Weiss. Why do deep convolutional net-works generalize so poorly to small image transformations?arXiv preprint arXiv:1805.12177, 2018. 6

[4] D. H. Ballard and C. M. Brown. Computer Vision. Prentice-Hall, 1982. 1

[5] H. Bay, T. Tuytelaars, and L. Van Gool. Surf speeded uprobust features. In Proceedings of the European Conferenceon Computer Vision (ECCV), 2006. 2

[6] N. D. Bruce and J. K. Tsotsos. Saliency, attention, and vi-sual search: An information theoretic approach. Journal ofvision, 9(3), 2009. 3

[7] Z. Che, A. Borji, G. Zhai, X. Min, G. Guo, and P. Le Callet.GazeGAN: A Generative Adversarial Saliency Model basedon Invariance Analysis of Human Gaze During Scene FreeViewing. arXiv preprint arXiv:1905.06803, 2019. 6

[8] S. Dodge and L. Karam. A study and comparison of humanand deep learning recognition performance under visual dis-tortions. In In Proceedings of the 26th International Confer-ence on Computer Communication and Networks (ICCCN),2017. 6

[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.4

[10] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool,M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Dese-laers, G. Dorko, et al. The 2005 PASCAL Visual ObjectClasses Challenge. In Machine Learning Challenges Work-shop, 2005. 2

[11] O. Faugeras. Three-Dimensional Computer Vision: A Geo-metric Viewpoint. MIT press, 1993. 1

[12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis andMachine Intelligence, 32(9), 2009. 3

[13] R. Geirhos, C. R. Temme, J. Rauber, H. H. Schutt,M. Bethge, and F. A. Wichmann. Generalisation in humansand deep neural networks. In Advances in Neural Informa-tion Processing Systems (NeuroIPS), 2018. 6

[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semantic

segmentation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2014. 3

[15] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In Proceedings of the IEEE International Conferenceon Computer Vision (ICCV), 2017. 5

[16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial Pyramid Pool-ing in Deep Convolutional Networks for Visual Recognition.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 37(9), 2015. 3

[17] G. E. Hinton, C. K. Williams, and M. D. Revow. Adaptiveelastic models for hand-printed character recognition. In Ad-vances in Neural Information Processing Systems, 1992. 1

[18] B. K. P. Horn. Robot vision. MIT press, 1986. 1[19] L. Itti, C. Koch, and E. Niebur. A model of saliency-based

visual attention for rapid scene analysis. IEEE Transactionson Pattern Analysis & Machine Intelligence, (11), 1998. 3

[20] T. Kadir and M. Brady. Saliency, scale and image descrip-tion. International Journal of Computer Vision, 45(2), 2001.2

[21] J. J. Koenderink. Solid shape. MIT press, 1990. 1[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

classification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, 2012.2

[23] Y. LeCun. The MNIST Database of Handwritten Digits.http://yann.lecun.com/exdb/mnist/, 1998. 1

[24] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. FocalLoss for Dense Object Detection. In Proceedings of the IEEEInternational Conference on Computer Vision (ICCV), 2017.4, 5

[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Com-mon Objects in Context. In Proceedings of the EuropeanConference on Computer Vision (ECCV), 2014. 4

[26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. SSD: Single Shot Multibox Detector. InProceedings of the European Conference on Computer Vi-sion (ECCV), 2016. 4

[27] D. Marr. Vision: A Computational Investigation into the Hu-man Representation and Processing of Visual Information.MIT Press, 1982. 1

[28] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions.Image and Vision Computing, 22(10), 2004. 2

[29] K. Mikolajczyk and C. Schmid. Scale & affine invariant in-terest point detectors. International Journal of Computer Vi-sion, 60(1), 2004. 2

[30] J. Nickolls and W. J. Dally. The GPU Computing Era. IEEEMicro, 30(2), 2010. 1

[31] J. Redmon and A. Farhadi. YOLOv3: An Incremental Im-provement. arXiv preprint arXiv:1804.02767, 2018. 4, 5

[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, 2015.5

[33] E. Rodner, M. Simon, R. B. Fisher, and J. Denzler. Fine-grained recognition in the noisy wild: Sensitivity analysis

of convolutional neural networks approaches. arXiv preprintarXiv:1610.06756, 2016. 6

[34] A. Rosenfeld. Digital Picture Processing. Academic press,1976. 1

[35] A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The elephant inthe room. arXiv preprint arXiv:1808.03305, 2018. 6

[36] P. Roy, S. Ghosh, S. Bhattacharya, and U. Pal. Effects ofdegradations on deep neural network architectures. arXivpreprint arXiv:1807.10108, 2018. 6

[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3), 2015. 2

[38] G. Salton. Introduction to Modern Information Retrieval.McGraw-Hill College, 1983. 3

[39] S. Shapiro, editor. The Encyclopedia of Artificial Intelli-gence. John Wiley & Sons, New York, 2 edition, 1992. 1

[40] L. Sirovich and M. Kirby. Low-dimensional procedure forthe characterization of human faces. Journal of Optical So-ciety of America, 4(3), 1987. 1

[41] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack forfooling deep neural networks. IEEE Transactions on Evolu-tionary Computation, 2019. 6

[42] R. Szeliski. Computer Vision: Algorithms and Applications.Springer, 2010. 1

[43] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars. Adeeper look at dataset bias. In Domain Adaptation in Com-puter Vision Applications. Springer, 2017. 4

[44] A. W. Tow, S. Shirazi, J. Leitner, N. Sunderhauf, M. Milford,and B. Upcroft. A robustness analysis of deep q networks.In Proceedings of the Australasian Conference on Roboticsand Automation, 2016. 6

[45] J. K. Tsotsos. A complexity level analysis of vision. Inter-national Journal of Computer Vision, 1, 1987. 1

[46] J. K. Tsotsos. The complexity of perceptual search tasks. InProceedings of the International Joint Conference on Artifi-cial Intelligence (IJCAI), volume 89, 1989. 1

[47] M. A. Turk and A. P. Pentland. Face Recognition UsingEigenfaces. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 1991. 1

[48] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.Smeulders. Selective search for object recognition. Interna-tional Journal of Computer Vision, 104(2), 2013. 3

[49] Y. Wu and J. Tsotsos. Active control of camera pa-rameters for object detection algorithms. arXiv preprintarXiv:1705.05685, 2017. 3, 4, 5

[50] D. Wueller and R. Fageth. Statistic analysis of millions ofdigital photos. In Digital Photography IV, volume 6817,2008. 4

[51] D. Wueller and R. Fageth. Statistic analysis of millions ofdigital photos 2017. Electronic Imaging, 2018(5), 2018. 4

[52] S. Zucker. Vision, early. In S. Shapiro, editor, The Ency-clopedia of Artificial Intelligence. John Wiley & Sons, NewYork, 2 edition, 1987. 1

a possible reason for why data-driven beats theory-driven ... · image sets encouraged exploration...

Documents