cvpr learning scene-specific ... - cv-foundation.org[1]vishnu naresh boddeti, takeo kanade, and bvk...

1
Learning Scene-Specific Pedestrian Detectors without Real Data Hironori Hattori 1 , Vishnu Naresh Boddeti 2 , Kris Kitani 2 , Takeo Kanade 2 1 Sony Corporation. 2 Carnegie Mellon University. Joint Classifier Ensemble Learning Scene Geometry Pedestrian Renderings Scene Simulation Location Specific Patches Pedestrian Detectors Detection Result Figure 1: Overview: For every grid location, geometrically correct renderings of pedestrian are synthetically generated using known scene information such as camera calibration parameters, obstacles (red), walls (blue) and walkable areas (green). All location-specific pedestrian detectors are trained jointly to learn a smoothly varying appearance model. Multiple scene-and-location-specific detectors are run in parallel at every grid location. Consider the scenario in which a new surveillance system is installed in a novel location and an image-based pedestrian detector must be trained without access to real scene-specific pedestrian data. A similar situation may arise when a new imaging system (i.e., a custom camera with unique lens distortion) has been designed and must be able to detect pedestrians without the expensive process of collecting data with the new imaging de- vice. One can use a generic pedestrian detection algorithm trained over co- pious amounts of real data to work robustly across many scenes. However, generic models are not always best-suited for detection in specific scenes. In many surveillance scenarios, it is more important to have a customized pedestrian detection model that is optimized for a single scene. Optimiz- ing for a single scene however often requires a labor intensive process of collecting labeled data – drawing bounding boxes of pedestrians taken with a particular camera in a specific scene. The process also takes time, as recorded video data must be manually mined for various instances of cloth- ing, size, pose and location to build a robust pedestrian appearance model. Creating a situation-specific pedestrian detector enables better performance but it is often costly to train. Our goal is to develop a method for training a ‘data-free’ scene-specific pedestrian detector which outperforms generic pedestrian detection algorithms (i.e., HOG-SVM[2], DPM[3]). The above scenarios are ill posed zero-instance learning problems, where an image-based pedestrian detector must be created without having access to real data. Fortunately, in these scenarios we have access to two im- portant pieces of information: (1) the camera’s calibration parameters, and (2) scene geometry (in the form of static location of objects, ground plane and walls). In this work, we show that with this information, it is possi- ble to generate geometrically accurate simulations of pedestrian appearance as training data (computer generated pedestrians) to act as a proxy for the real data. This allows us to learn a highly accurate scene-specific pedes- trian detector. Moreover, we show that by using this ‘data-free’ technique (i.e.,, does not require real pedestrian data), we are still able to train a scene- specific pedestrian that outperforms several baseline techniques. They key idea of our approach is to maximize the geometric informa- tion about the scene to compensate for the lack of real training data. A geometrically consistent method for synthetic data generation has the fol- lowing advantages. (1) An image-based pedestrian detector can be trained on a wider range of pedestrian appearance. Instead of waiting and collect- ing real data, it is now possible to generate large amounts of simulated data over a wide range of pedestrian appearance (e.g., clothing, height, weight, gender) on demand. (2) Pedestrian data can be generated for any location in the scene. Taken to the extreme, a synthetic data-generation framework can be used learn a customized pedestrian appearance model for every possible location (pixel) in the scene. (3) The location of static objects in the scene can be incorporated into data synthesis to preemptively train for occlusion. In our proposed approach, we simultaneously learn hundreds of pedes- trian detectors for a single scene using millions of synthetic pedestrian im- ages. Since our approach is purely dependent on synthetic data, the al- gorithm requires no real-world data. The main contributions of our work are as follows: (1) the first work to learn a scene-specific location-specific geometry-aware pedestrian detection model using purely synthetic data and (2) an efficient and scalable algorithm for discriminatively learning a large number of scene-specific location-specific pedestrian detectors. Our algo- rithmic framework makes use of highly-efficient correlation filters[1] as our basic detection unit and globally optimizes each model by taking into the account the appearance of a pedestrian over a small spatial region. Our experiments showed that our model outperforms several base- line approaches–classical pedestrian detection models, hybrid synthetic-real models and a baseline which leverages the scene geometry and camera cali- bration parameters at the inference stage–both in terms of image plane local- ization as well as localization in 3D for the task of scene-specific pedestrian detection. More importantly our results also yield a surprising result, that our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited. [1] Vishnu Naresh Boddeti, Takeo Kanade, and BVK Kumar. Correlation filters for object alignment. In CVPR, 2013. [2] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9):1627–1645, 2010. This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage.

Upload: others

Post on 05-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CVPR Learning Scene-Specific ... - cv-foundation.org[1]Vishnu Naresh Boddeti, Takeo Kanade, and BVK Kumar. Correlation filters for object alignment. In CVPR, 2013. [2]Navneet Dalal

Learning Scene-Specific Pedestrian Detectors without Real Data

Hironori Hattori1, Vishnu Naresh Boddeti2, Kris Kitani2, Takeo Kanade2

1Sony Corporation. 2Carnegie Mellon University.

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#168

CVPR#168

CVPR 2015 Submission #168. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

JointClassifierEnsembleLearning

Scene Geometry

Pedestrian Renderings

Scene Simulation

Location Specific Patches Pedestrian Detectors

Detection Result

Figure 2. Overview. For every grid location, geometrically correct renderings of pedestrian are synthetically generated using known sceneinformation such as camera calibration parameters, obstacles (red), walls (blue) and walkable areas (green). All pedestrian detectors aretrained jointly to learn a smoothly varying appearance model. Multiple scene-specific detectors are run in parallel at every grid location.

single scene. Optimizing for a single scene however of-ten requires a labor intensive process of collecting labeleddata – drawing bounding boxes of pedestrians taken witha particular camera in a specific scene. The process alsotakes time, as recorded video data must be manually minedfor various instances of clothing, size, pose and locationto build a robust pedestrian appearance model. Creatinga situation-specific pedestrian detector enables better per-formance but it is often costly to train. In this work, weprovide an alternative technique for learning high-qualityscene-specific pedestrian detector without the need for realpedestrian data. Instead we leverage the geometric informa-tion of the scene (i.e. static location of objects, ground planeand walls) and the parameters of the camera to generate ge-ometrically accurate simulations of pedestrian appearancethrough the use of computer generated pedestrians. In thisway, we are able to synthesize data as a proxy to the realdata, allowing us to learn a highly accurate scene-specificpedestrian detector.

They key idea of our approach is to maximize the ge-ometric information about the scene to compensate forthe lack of real training data. A geometrically consistentmethod for synthetic data generation has the following ad-vantages. (1) An image-based pedestrian detector can betrained on a wider range of pedestrian appearance. Insteadof waiting and collecting real data, it is now possible to gen-erate large amounts of simulated data over a wide range ofpedestrian appearance (e.g. clothing, height, weight, gen-der) on demand. (2) Pedestrian data can be generated forany location in the scene. Taken to the extreme, a syn-thetic data-generation framework can be used learn a cus-tomized pedestrian appearance model for every possible lo-cation (pixel) in the scene. (3) The location of static objectsin the scene can be incorporated into data synthesis to pre-emptively train for occlusion.

In our proposed approach, we simultaneously learn hun-dreds of pedestrian detectors for a single scene using mil-lions of synthetic pedestrian images. Since our approachis purely dependent on synthetic data, the algorithm re-quires no real-world data. To learn the set of scene-specificlocation-specific pedestrian detectors, we propose an effi-cient and scalable appearance learning framework. Our al-gorithmic framework makes use of highly-efficient corre-lation filters as our basic detection unit and globally opti-mizes each model by taking into the account the appearanceof a pedestrian over a small spatial region. We compareour approach to several generically trained baseline mod-els and show that our proposed approach generates a betterperforming scene-specific pedestrian detector. More impor-tantly, our experimental results over multiple data sets showthat our ‘data-free’ approach actually outperforms modelsthat are trained on real scene-specific pedestrian data whendata is limited.Contributions. The contribution of our work is as follows:(1) the first work to learn a scene-specific location-specificgeometry-aware pedestrian detection model using purelysynthetic data and (2) an efficient and scalable algorithm forlearning a large number of scene-specific location-specificpedestrian detectors.

2. Related Work3D Models for Detection. The idea of using 3D syntheticdata for 2D object detection is not new. Brooks [8] usedcomputerized 3D primitives to describe 2D images. Dhomeet al. used computer generated models have been usedto recognize articulated objects from a single image [10].3D computer graphics models have been used for mod-eling human shape [16, 7], body-part gradient templates[1], full-body gradient templates [21] and hand appearance[26, 2, 27]. In addition to modeling people, 3D simulation

2

Figure 1: Overview: For every grid location, geometrically correct renderings of pedestrian are synthetically generated using known scene informationsuch as camera calibration parameters, obstacles (red), walls (blue) and walkable areas (green). All location-specific pedestrian detectors are trainedjointly to learn a smoothly varying appearance model. Multiple scene-and-location-specific detectors are run in parallel at every grid location.

Consider the scenario in which a new surveillance system is installedin a novel location and an image-based pedestrian detector must be trainedwithout access to real scene-specific pedestrian data. A similar situationmay arise when a new imaging system (i.e., a custom camera with uniquelens distortion) has been designed and must be able to detect pedestrianswithout the expensive process of collecting data with the new imaging de-vice. One can use a generic pedestrian detection algorithm trained over co-pious amounts of real data to work robustly across many scenes. However,generic models are not always best-suited for detection in specific scenes.In many surveillance scenarios, it is more important to have a customizedpedestrian detection model that is optimized for a single scene. Optimiz-ing for a single scene however often requires a labor intensive process ofcollecting labeled data – drawing bounding boxes of pedestrians taken witha particular camera in a specific scene. The process also takes time, asrecorded video data must be manually mined for various instances of cloth-ing, size, pose and location to build a robust pedestrian appearance model.Creating a situation-specific pedestrian detector enables better performancebut it is often costly to train. Our goal is to develop a method for traininga ‘data-free’ scene-specific pedestrian detector which outperforms genericpedestrian detection algorithms (i.e., HOG-SVM[2], DPM[3]).

The above scenarios are ill posed zero-instance learning problems,where an image-based pedestrian detector must be created without havingaccess to real data. Fortunately, in these scenarios we have access to two im-portant pieces of information: (1) the camera’s calibration parameters, and(2) scene geometry (in the form of static location of objects, ground planeand walls). In this work, we show that with this information, it is possi-ble to generate geometrically accurate simulations of pedestrian appearanceas training data (computer generated pedestrians) to act as a proxy for thereal data. This allows us to learn a highly accurate scene-specific pedes-trian detector. Moreover, we show that by using this ‘data-free’ technique(i.e.,, does not require real pedestrian data), we are still able to train a scene-specific pedestrian that outperforms several baseline techniques.

They key idea of our approach is to maximize the geometric informa-tion about the scene to compensate for the lack of real training data. Ageometrically consistent method for synthetic data generation has the fol-lowing advantages. (1) An image-based pedestrian detector can be trained

on a wider range of pedestrian appearance. Instead of waiting and collect-ing real data, it is now possible to generate large amounts of simulated dataover a wide range of pedestrian appearance (e.g., clothing, height, weight,gender) on demand. (2) Pedestrian data can be generated for any location inthe scene. Taken to the extreme, a synthetic data-generation framework canbe used learn a customized pedestrian appearance model for every possiblelocation (pixel) in the scene. (3) The location of static objects in the scenecan be incorporated into data synthesis to preemptively train for occlusion.

In our proposed approach, we simultaneously learn hundreds of pedes-trian detectors for a single scene using millions of synthetic pedestrian im-ages. Since our approach is purely dependent on synthetic data, the al-gorithm requires no real-world data. The main contributions of our workare as follows: (1) the first work to learn a scene-specific location-specificgeometry-aware pedestrian detection model using purely synthetic data and(2) an efficient and scalable algorithm for discriminatively learning a largenumber of scene-specific location-specific pedestrian detectors. Our algo-rithmic framework makes use of highly-efficient correlation filters[1] as ourbasic detection unit and globally optimizes each model by taking into theaccount the appearance of a pedestrian over a small spatial region.

Our experiments showed that our model outperforms several base-line approaches–classical pedestrian detection models, hybrid synthetic-realmodels and a baseline which leverages the scene geometry and camera cali-bration parameters at the inference stage–both in terms of image plane local-ization as well as localization in 3D for the task of scene-specific pedestriandetection. More importantly our results also yield a surprising result, thatour method using purely synthetic data is able to outperform models trainedon real scene-specific data when data is limited.

[1] Vishnu Naresh Boddeti, Takeo Kanade, and BVK Kumar. Correlationfilters for object alignment. In CVPR, 2013.

[2] Navneet Dalal and Bill Triggs. Histograms of oriented gradients forhuman detection. In CVPR, 2005.

[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.Object detection with discriminatively trained part based models. PAMI,32(9):1627–1645, 2010.

This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage.