preprint on arxiv 1 ptb-tir: a thermal infrared ...tracking system. the experimental findings are...

10
PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark Qiao Liu, Student Member, IEEE, Zhenyu He, Senior Member, IEEE, Abstract—Thermal infrared (TIR) pedestrian tracking is one of the most important components in numerous applications of computer vision, which has a major advantage: it can track the pedestrians in total darkness. How to evaluate the TIR pedestrian tracker fairly on a benchmark dataset is significant for the development of this field. However, there is no a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based eval- uation. In addition to the dataset, we carry out the large-scale evaluation experiments on our benchmark dataset using nine public available trackers. The experimental results help us to understand the strength and weakness of these trackers. What’s more, in order to get insight into the TIR pedestrian tracker more sufficiently, we divide a tracker into three components: feature extractor, motion model, and observation model. Then, we conduct three comparison experiments on our benchmark dataset to validate how each component affects the tracker’s performance. The findings of these experiments provide some guidelines for future research. Index Terms—thermal infrared, pedestrian tracking, bench- mark, dataset I. I NTRODUCTION A Long with the development of thermal imaging technol- ogy, both quality and resolution of the thermal images have been improved to a great extent. That enables a series of computer vision tasks based on thermal images can be applied to the civilian field. TIR pedestrian tracking is one of the most important vision tasks which has received much attention in recent years. It has wide applications such as surveillance, driving assistance, and rescue at night [1]. Numerous TIR pedestrian trackers [2]–[8] have been proposed over the past decade. Despite much progress has been achieved, TIR pedestrian tracking still faces many challenges, e.g., occlu- sion, background clutter, motion blur, low resolution, thermal crossover, etc. Evaluating and comparing the TIR pedestrian tracker fairly is crucial for the development of this filed. Usually, there is not a single tracker that can handle all challenges. Therefore, it is necessary to compare and analyze the strength and weakness of each tracker, thereby giving some guidelines to the future study for a better tracker. In order to do that, it is critical to collect a representative dataset. Several TIR datasets are used, such as OSU Color-Thermal [9], Terravic Motion IR [10], PDT-ATV [11], and BU-TIV [12] databases. However, it can Q. Liu, Z. He (Corresponding author) are with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China. e-mail: ([email protected]). not fair compare the TIR pedestrian tracker on these datasets due to the following reasons. First, these datasets just have a few sequences that are not sufficient for fair evaluation. Second, the thermal images just are captured from a thermal camera that lacks diversity. Third, the thermal images have low resolution and static background that not suit to various real- world scenarios. Fourth, these datasets do not provide a unified ground-truth and attribute annotations. All of these reasons show that this field lacks a benchmark dataset to fair evaluate the TIR pedestrian tracker. Modern TIR pedestrian tracker is a complicated system which is consisted of several separate components. Each component can affect its performance. Evaluating a TIR pedestrian tracker as a whole helps us to understand its overall performance, but cannot know the effectiveness of each component. To understand the TIR pedestrian tracker more sufficiently, it is necessary to compare each component sep- arately. Similar to [13], we divide the TIR pedestrian tracker into three components: feature extractor, motion model, and observation model. After dividing, some interesting questions will be raised involuntarily. Which feature is more suitable for TIR pedestrian tracking? How different motion models and observation models affect the tracking performance? The answers to these questions provide some references to the direction of future research. In this paper, to fair evaluate and compare TIR pedestrian trackers, we first collect a TIR pedestrian dataset 1 for short- term tracking task. The dataset includes 60 video sequences with manual annotations. The entire dataset is divided into nine different attribute subsets. On each subset, we can evaluate the tracker’s abiliaty of handling the corresponding challenge. Then, we carry out a large-scale fair performance evaluation on several public available trackers. Finally, in order to get insight into the TIR pedestrian tracking, we further conduct three validation experiments on three compoents of the tracker. First, we compare several different features on a baseline tracker to analyze which feature is more suitable for TIR pedestrian tracking. Second, we compare several different motion models on a baseline tracker to analyze how they affect the tracking results. Third, to explore how different observation models affect the tracking performance, we compare several trackers with different observation models. The findings of these three experiments make us understand the TIR pedestrian tracker more sufficiently. The contributions of the paper are two-fold: 1 The PTB-TIR dataset and code library can be accessed at http://www. hezhenyu.cn/PTB-TIR.html arXiv:1801.05944v1 [cs.CV] 18 Jan 2018

Upload: others

Post on 06-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 1

PTB-TIR: A Thermal Infrared Pedestrian TrackingBenchmark

Qiao Liu, Student Member, IEEE, Zhenyu He, Senior Member, IEEE,

Abstract—Thermal infrared (TIR) pedestrian tracking is oneof the most important components in numerous applications ofcomputer vision, which has a major advantage: it can track thepedestrians in total darkness. How to evaluate the TIR pedestriantracker fairly on a benchmark dataset is significant for thedevelopment of this field. However, there is no a benchmarkdataset. In this paper, we develop a TIR pedestrian trackingdataset for the TIR pedestrian tracker evaluation. The datasetincludes 60 thermal sequences with manual annotations. Eachsequence has nine attribute labels for the attribute based eval-uation. In addition to the dataset, we carry out the large-scaleevaluation experiments on our benchmark dataset using ninepublic available trackers. The experimental results help us tounderstand the strength and weakness of these trackers. What’smore, in order to get insight into the TIR pedestrian trackermore sufficiently, we divide a tracker into three components:feature extractor, motion model, and observation model. Then,we conduct three comparison experiments on our benchmarkdataset to validate how each component affects the tracker’sperformance. The findings of these experiments provide someguidelines for future research.

Index Terms—thermal infrared, pedestrian tracking, bench-mark, dataset

I. INTRODUCTION

ALong with the development of thermal imaging technol-ogy, both quality and resolution of the thermal images

have been improved to a great extent. That enables a series ofcomputer vision tasks based on thermal images can be appliedto the civilian field. TIR pedestrian tracking is one of the mostimportant vision tasks which has received much attention inrecent years. It has wide applications such as surveillance,driving assistance, and rescue at night [1]. Numerous TIRpedestrian trackers [2]–[8] have been proposed over thepast decade. Despite much progress has been achieved, TIRpedestrian tracking still faces many challenges, e.g., occlu-sion, background clutter, motion blur, low resolution, thermalcrossover, etc.

Evaluating and comparing the TIR pedestrian tracker fairlyis crucial for the development of this filed. Usually, there is nota single tracker that can handle all challenges. Therefore, it isnecessary to compare and analyze the strength and weaknessof each tracker, thereby giving some guidelines to the futurestudy for a better tracker. In order to do that, it is critical tocollect a representative dataset. Several TIR datasets are used,such as OSU Color-Thermal [9], Terravic Motion IR [10],PDT-ATV [11], and BU-TIV [12] databases. However, it can

Q. Liu, Z. He (Corresponding author) are with the School of ComputerScience and Technology, Harbin Institute of Technology, Shenzhen, China.e-mail: ([email protected]).

not fair compare the TIR pedestrian tracker on these datasetsdue to the following reasons. First, these datasets just havea few sequences that are not sufficient for fair evaluation.Second, the thermal images just are captured from a thermalcamera that lacks diversity. Third, the thermal images have lowresolution and static background that not suit to various real-world scenarios. Fourth, these datasets do not provide a unifiedground-truth and attribute annotations. All of these reasonsshow that this field lacks a benchmark dataset to fair evaluatethe TIR pedestrian tracker.

Modern TIR pedestrian tracker is a complicated systemwhich is consisted of several separate components. Eachcomponent can affect its performance. Evaluating a TIRpedestrian tracker as a whole helps us to understand itsoverall performance, but cannot know the effectiveness of eachcomponent. To understand the TIR pedestrian tracker moresufficiently, it is necessary to compare each component sep-arately. Similar to [13], we divide the TIR pedestrian trackerinto three components: feature extractor, motion model, andobservation model. After dividing, some interesting questionswill be raised involuntarily. Which feature is more suitablefor TIR pedestrian tracking? How different motion modelsand observation models affect the tracking performance? Theanswers to these questions provide some references to thedirection of future research.

In this paper, to fair evaluate and compare TIR pedestriantrackers, we first collect a TIR pedestrian dataset1 for short-term tracking task. The dataset includes 60 video sequenceswith manual annotations. The entire dataset is divided into ninedifferent attribute subsets. On each subset, we can evaluatethe tracker’s abiliaty of handling the corresponding challenge.Then, we carry out a large-scale fair performance evaluation onseveral public available trackers. Finally, in order to get insightinto the TIR pedestrian tracking, we further conduct threevalidation experiments on three compoents of the tracker. First,we compare several different features on a baseline trackerto analyze which feature is more suitable for TIR pedestriantracking. Second, we compare several different motion modelson a baseline tracker to analyze how they affect the trackingresults. Third, to explore how different observation modelsaffect the tracking performance, we compare several trackerswith different observation models. The findings of these threeexperiments make us understand the TIR pedestrian trackermore sufficiently.

The contributions of the paper are two-fold:

1The PTB-TIR dataset and code library can be accessed at http://www.hezhenyu.cn/PTB-TIR.html

arX

iv:1

801.

0594

4v1

[cs

.CV

] 1

8 Ja

n 20

18

Page 2: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 2

TABLE I: Comparison of our dataset with other datasets.

OSU Color-Thermal [9] Terravic Motion IR [10] PDT-ATV [11] BU-TIV [12] PTB-TIR (Ours)

Device Type Raytheon PalmIR 250D Raytheon Thermal-Eye 2000AS FLIR Tau 320 FLIR SC8000 More than 8 sources

Resolution 320×240 320×240 324×256 up to 1024×640 up to 1280×720

Bit Depth 8 8 8 16 8

Sequences Number 3 11 8 5 60

Mean Length 2848 500 486 4530 502

Total Frames 8544 5500 3888 22654 30128

• We construct a TIR pedestrian tracking benchmarkdataset with 60 annotated sequences for the TIR pedes-trian tracker evaluation. A large-scale performance eval-uation is implemented on our benchmark with severalpublic available trackers.

• Several validation experiments on the proposed bench-mark are carried out to get insight into the TIR pedestriantracking system. The experimental findings are analyizedto given some guidelines for future research.

The rest of the paper is organized as follows. We firstintroduce TIR pedestrian tracking datasets and methods brieflyin Section II. Then, we present the contents of the proposedbenchmark in Section III. Subsequently, the extensive perfor-mance evaluation and validation experiments are carried out inSection IV and Section V respectively. Finally, we draw a shortconclusion and describe some future works in Section VI.

II. RELATED WORK

In this section, we first introduce existing datasets which canbe used for TIR pedestrian tracking in Section II-A. Then, wediscuss the progress of the TIR pedestrian tracking methodsin Section II-B.

A. TIR Pedestrian Tracking Datasets

There is no a standard and specialized TIR pedestriantracking dataset but several datasets can be used to simplytest TIR pedestrian tracker.

OSU Color-Thermal. The original purpose of this dataset [9]is to do object detection using RGB video and thermal video.The dataset contains 3 TIR pedestrian videos are captured froma fixed thermal sensor: Raytheon PalmIR 250D. These videoshave a low resolution 320× 240 and their backgrounds arestatic.

Terravic Motion IR. This dataset [10] is designed for de-tection and tracking task in the thermal video. It has 18TIR sequences captured from a thermal camera Raytheon L-3Thermal-Eye 2000AS. Among these sequences, 11 pedestriansequences are suitable for tracking task. These sequences areall in the same wild scene with a static background and a lowresolution 320×240.

PDT-ATV. PDT-ATV [11] is a TIR pedestrian tracking anddetection dataset is captured from a simulative unmannedaerial vehicle (UAV). The dataset contains 8 sequences withthe same resolution 324×256. The dataset provides manually

ground-truths of the object but not annotates the attributes ofeach sequence.

BU-TIV. BU-TIV [12] is a large-scale dataset for several vi-sual analysis tasks in TIR videos. It contains 5 TIR pedestrianvideos that can be used for tracking task. The resolution ofthese videos is ranging from 512× 512 to 1024× 640. Thedataset provides the annotations of the object but not offersany evaluation codes for tracking task.

Table I compares the above-mentioned datasets with ourdataset. Although these datasets can be used in the TIRpedestrian tracking to test a tracker, they are not suitablefor fair comparison and evaluation. In this paper, in order tofair compare and evaluate a tracker, we collect a large-scaleTIR pedestrian tracking benchmark dataset with 60 annotatedsequences.

B. TIR Pedestrian Tracking Methods

In the past decade, numerous TIR pedestrian trackingmethods have been proposed to solve the various challenges.General speaking, there are two categories of TIR pedestriantrackers: generative and discriminative. Generative TIR pedes-trian trackers focus on the modeling of pedestrian’s appearanceand search the most similar candidates in the next frame.Some representative methods are template matching [14],[15], sparse representation [5], [16]. Unlike the generativemethods, discriminative TIR pedestrian tracking methods castthe tracking problem as a binary classification problem whichdistinguishes the object target from its backgrounds. Thanksthe advances of the machine learning, a series of classifier canbe used in the TIR pedestrian tracking, such as random for-est [4], mean shift [17], support vector machine (SVM) [18].Here, we discuss these methods according to three components: feature extractor, motion model, and observation model.

Feature Extractor. Obtaining a powerful representation ofthe object target is a crucial step in TIR pedestrian tracking.Different from the visual pedestrian tracking that often usesthe color feature, intensity feature is widely used in the TIRpedestrian tracking due to a basic assumption that the objecttarget is warmer than its background in the thermal images.However, a tracker only using the intensity feature oftenfails when two similar pedestrians are crossing each other.In order to get more discriminative feature representation,several TIR pedestrian trackers exploit multiple features fusionstrategy. For example, Wang et al. [19] combine the intensityand edge cues via an adaptive multi-cue integration scheme.

Page 3: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 33

crowd3, 0.824OCC,SV,BC

jacket, 0.802OCC,SV,BC

stranger3, 0.781OCC,BC,MB

sidewalk2, 0.751OCC,SV,BC,MB

birds, 0.725OCC,SV,BC,IV

stranger2, 0.724OCC,BC,LR,MB

park1, 0.716OCC,BC,MB,TC

classroom3, 0.710OCC,SV,BC,OV

phone2, 0.709OCC,SV,BC,FM...

sidewalk3, 0.709BC,MB

room2, 0.680OCC,SV,BC,OV...

stranger1, 0.658BC,MB

crouching, 0.634OCC,CV,BC

crowd4, 0.615OCC,BC,OV

meetion3, 0.611OCC,SV,BC,LR...

fighting, 0.603BC,FM,MB,TC

distractor1, 0.603OCC,BC,MB,TC

park5, 0.600OCC,SV,BC,LR...

hiding, 0.549OCC,SV

street1, 0.534SV

phone3, 0.531SV,FM

distractor2, 0.528SV,BC,IV,TC

road3, 0.513OCC,SV,BC,MB

circle2, 0.478LR,MB

sandbeach, 0.473OCC,BC,IV

crossroad2, 0.468BC,LR

crossing, 0.465SV,BC

crowd1, 0.464OCC,BC

conversation,0.460 BC,MB,TC

patrol2, 0.441SV,LR,MB

airplane, 0.438OCC,SV,BC,MB...

meetion4, 0.433SV,BC,LR,MB...

street4,0.425OCC,BC,SV,FM

campus1, 0.406SV,BC,LR

sidewalk1, 0.406SV,BC,LR,FM...

classroom1, 0.400SV,BC,OV

saturated, 0.400OCC,BC,TC

crossroad1, 0.366OCC,BC,LR,TC

patrol1, 0.346SV,BC,TC

circle1, 0.336BC,MB

street5, 0.331OCC,BC,SV,TC

park3, 0.328SV,MB,OV

road1, 0.318OCC,BC

street3, 0.316SV,BC

phone1, 0.299OCC,SV,BC,MB...

school2, 0.298OCC,BC,FM

soldier, 0.295OCC,BC,TC

walking, 0.271MB

Fig. 1: The target’s ground-truth bounding box in the first frame of some selected sequences from our dataset. The sequencesare ranked according to its difficult degree which is the average TRE AUC score of nine trackers on each sequence. Thesequence name, difficult degree, and corresponding attributes are shown below the image respectively. Bule font denotes oldsequences used in the previous studies, while red font denotes for newly collected sequences .

Motion Model. How to quickly and accurately generatea series of target candidates in the next frame is anotherimportant components in the TIR pedestrian tracking. Usu-ally, two categories methods are used to generate the targetcandidates: probabilistic estimation and exhaustive search. Forthe probabilistic estimation method, Kalman filter and particlefilter are often exploited in TIR pedestrian tracking. Forinstance, Xu et al. [17] combine the Kalman filter and meanshift to track the pedestrian by a vehicle-mounted thermalcamera. Wang et al. [19] propose a multiple features fusionTIR pedestrian tracker based on the particle filter framework.Several other TIR pedestrian tracking methods [2]–[4] are alsouse particle filter. For the exhaustive search method, there aretwo kind of methods are often used in visual tracking: slidingwindow [13] and radius sliding window [21], they also suit

for TIR pedestrian tracking.Observation Model. In the discriminative TIR pedestrian

tracking method, the observation models are used to obtain thetracked target choosing from a series of candidates. Usually,we train a binary classifier to complete this task. Thanks tothe developments of machine learning, numerous classificationmethods can be used in this filed. For example, Xu et al. [17]use mean shift to get the final tracked target from a seriestarget candidates generated from Kalman filter. Ko et al. [4]exploit random forest to obtain the tracked target from the can-didates. Some other classification method such as SVM [18],boosting [22], multiple instance learning (MIL) [23] are alsosuit for TIR pedestrian tracking

These three components of the TIR pedestrian tracker effectthe tracking performance with different ways. However, how

Fig. 1: Some annotated sequences of our proposed dataset. The target’s ground-truth bounding box in the first frame is shown.The sequences are ranked according to its difficult degree which is the average TRE score of the tested trackers on eachsequence (see the supplemental material). The sequence name, difficult degree, and corresponding attributes are shown belowthe image respectively. Bule font denotes old sequences used in the previous studies, while red font denotes for newly collectedsequences.

In [3], the authors also fuse the intensity and edge informationusing a relative discriminative coefficient. In [4], the authorscoalesce local intensity distribution (LID) and oriented centersymmetric local binary patterns (OCS-LBP) to represent theTIR pedestrian. What’s more, some other features are oftenused in TIR pedestrian tracking, such as regions of interest(ROI) histogram [2], speeded up robust features (SURF) [15],and the histogram of oriented gradient (HOG) [20].

Motion Model. How to quickly and accurately generatea series of target candidates in the next frame is anotherimportant component in the TIR pedestrian tracking. Usually,two kinds of methods are used to generate the target candi-dates: probabilistic estimation and exhaustive search. For theprobabilistic estimation methods, Kalman filter and particle

filter are often exploited in TIR pedestrian tracking. Forinstance, Xu et al. [17] combine the Kalman filter and meanshift to track the pedestrian by a vehicle-mounted thermalcamera. Wang et al. [19] propose a multiple features fusionTIR pedestrian tracker based on the particle filter framework.Several other TIR pedestrian tracking methods [2]–[4] also usethe particle filter. For the exhaustive search method, there aretwo kinds of method are often used in visual tracking: slidingwindow [13] and radius sliding window [21], they also suitfor TIR pedestrian tracking.

Observation Model. In the discriminative TIR pedestriantracking methods, the observation models are used to obtainthe tracked target from a series of candidates. Usually, wetrain a binary classifier to complete this task. Thanks to the

Page 4: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 4

developments of machine learning, numerous classificationmethods are used. For example, Xu et al. [17] use mean shift toget the final tracked target from a series of target candidatesgenerated from Kalman filter. Ko et al. [4] exploit randomforest to obtain the tracked target from the candidates. Someother classification method, such as SVM [18], boosting [22],multiple instance learning (MIL) [23], are also suitable forTIR pedestrian tracking

These three components of the TIR pedestrian trackeraffect the tracking performance in different ways. How eachcomponent affects the final tracking results? In this paper,we conduct several validation experiments on our proposedbenchmark to answer this question.

III. TIR PEDESTRIAN TRACKING BENCHMARK

In this section, we first introduce our TIR pedestrian track-ing dataset in Section III-A, and then, present the evaluationmethodology for the TIR pedestrian tracker in Section III-B.These two parts constitute our TIR pedestrian tracking bench-mark.

A. Dataset

In the past decade, several datasets [9]–[12] are used toevaluate the TIR pedestrian tracker. However, these datasetsdo not provide uniform ground-truths for fair evaluation. Toimplement fair performance evaluation and comparison for theTIR pedestrian tracker, we collect 60 thermal sequences, andthen annotate them manually. These sequences come from thedifferent devices, scenes, and shooting time. For each shootingproperty, we collect a series of corresponding thermal videos.More basic properties can be found in Table II. These differentproperties ensure the diversity of the dataset.

TABLE II: Basic properties of the shooting videos and thecorresponding video number.

Property Name Property value: video number

Device Categories Surveillance camera: 29; Hand-held camera: 20;Vehicle-mounted camera: 8; Drone camera: 3

Scene Types Outdoor: 52; Indoor: 8

Shooting Time Night: 42; Day: 18

Camera Motion Static: 43; Dynamic: 17

Camera Views Down: 31; Level: 29

Object Distance Far: 21; Middle: 23; Near: 16

Object Size Big: 12; Middle: 34; Small: 14

Sources. Our datasets are collected from the existing com-monly used thermal sequences and Internet. Four thermalsequences are obtained from OSU Color-Thermal dataset [9].Six sequences are adopted from Terravic Motion IR [10] andnine sequences are chosen from BU-TIV [12]. Two sequencescame from LITIV2012 [24] and five sequences came fromdetection dataset CVC-09 [25] and CVC-14 [26]. In addition,we select seven pedestrian videos from the TIR object trackingbenchmark: VOT-TIR2016 [27]. The rest of the sequencesare collected from the Internet, such as INO dataset [28] andYouTube [29].

Annotations. We use an external rectangle bounding box ofthe target as its ground-truth. The first frame annotations ofseveral sequences are shown in Fig. 1. The left corner point,width, and height of the target’s bounding box are recorded torepresent the ground-truth.

Attributes of a Sequence. Evaluating a TIR pedestrian tackeris usually difficult because several attributes can affect itsperformance. In order to better evaluate the strength and weak-ness of a TIR pedestrian tacker, we summarize nine commonattributes in the sequences, as shown in Table III. For eachattribute, we construct a corresponding subset for evaluation.The performance of a tracker on an attribute subset shows itsability for handling the corresponding challenge. The attributesdistribution of the entire dataset and an attribute subset areshown in Fig. 2. The other subset’s attributes distribution areshown in the supplemental material. We can see that thebackground clutter has a high ratio because the pedestriansoften have similar texture and intensity in the thermal images.In addition, the occlusion and scale variation also often occursin the real-world scenarios. The intensity variation subset onlyincludes three sequences due to the pedestrians tend to haveinvariable temperature in a short time.

TABLE III: Attributes of a thermal sequence.

Chall Description

OCC Occlusion: the target is partially or fully occluded.

SV Scale variation: the ratios of the target’s size in the first frameand current frame is out of the range [1/2, 2].

BC Background clutter: the background near the target has similartexture or intensity.

LR Low resolution: the target’s size is lower than 600 pixels.

FM Fast motion: the distances of target in the consecutive frameare larger than 20 pixels.

MB Motion blur: the target region is blurred due to the target orcamera motion.

OV Out-of-view: the target is partially out of the image region.

IV Intensity variation: the intensity of the target region haschanged due to the temperature variation of the target.

TC Thermal crossover: Two targets with similar intensity crosseach other.

0

20

40

60

OCC SV BC LR FM MB OV IV TC

Attribute distribution- dataset

0

10

20

30

40

OCC SV BC LR FM MB OV IV TC

Attribute distribution- scale variation

Fig. 2: The attributes distribution of the entire dataset and ascale variation subset.

B. Evaluation Methodology

There are two commonly used evaluation metrics in visualobject tracking: center location error (CLE) and overlap ratio.

Page 5: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 5

In this paper, we also adopt these two metrics to evaluate TIRpedestrian tracker. Follow to OTB [30], we use the precisionand success rate for quantitative analysis.

Precision Plot. CLE is the average Euclidean distance betweenthe tracked target and target’s ground-truth. If a CLE is withina given threshold (20 pixels), we say the tracking is successfulin this frame. The precision plot measures the percentage ofthe successful frames in the entire dataset.

Success Plot. Overlap score measures the overlap ratio be-tween the bounding box area of tracked target and ground-truth. If the score is larger than a given threshold, we say thetracking is successful in this frame. The success plot showsthe ratios of successful frames at the threshold varying from0 to 1. we use the area under curve (AUC) of each successplot to rank the trackers.

Robustness Evaluation. Usually, a tracker is sensitive to theinitialization, and one-pass evaluation (OPE) does not providethe robutness evaluation. In order to measure a tracker’srobustness to the different initialization, we use the temporalrobustness evaluation (TRE) and spatial robustness evaluation(SRE). TRE runs 20 times with different initial frame in asequence, and then is to calculate the average precision andsuccess rate. SRE runs 12 times with different initial boundingbox by shifting or scaling.

Speed Evaluation. Speed is the other important aspect of atracker. We run all trackers on the same hardware device andcalculate their average frames per second (fps) in the entiredataset.

IV. EVALUATION EXPERIMENTS

In this section, we implement a large-scale fair evaluationexperiments on the proposed benchmark. First, an overallperformance evaluation of nine trackers is conducted in Sec-tion IV-A. Second, we evaluate the performance of thesetrackers on the attribute subset in Section IV-B. Third, a speedcomparison experiment is presented in Section IV-C.

A. Overall Performance Evaluation

Evaluated Trackers. Nine public available trackers are eval-uated on our benchmark. Since the TIR pedestrian trackershave no public available codes, we choose some commonlyused visual trackers and TIR trackers for evaluation. Thesetrackers can be used for the TIR pedestrian tracking and canbe generally divided into four categories:

• Three correlation filter based trackers, kernelized cor-relation filters (KCF [31]), scale correlation filter(DSST [32]), spatially regularized correlation filters(SRDCF [33]).

• Two deep learning based trackers, HDT [34],MCFTS [35].

• Two regression based trackers, Ridge Regression(RR [13]), Gaussian regression (TGPR [36]).

• Two other trackers, SVM [13], sparse representation(L1APG [37]).

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of OPE

SRDCF [0.775]DSST [0.701]TGPR [0.689]MCFTS [0.673]HDT [0.669]RR [0.659]SVM [0.626]L1APG [0.595]KCF [0.589]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots of OPE

SRDCF [0.569]DSST [0.515]MCFTS [0.480]TGPR [0.476]HDT [0.445]RR [0.427]L1APG [0.416]SVM [0.404]KCF [0.395]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of TRE

SRDCF [0.756]MCFTS [0.720]TGPR [0.713]DSST [0.711]HDT [0.685]RR [0.674]KCF [0.650]SVM [0.640]L1APG [0.638]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots of TRE

SRDCF [0.573]DSST [0.540]MCFTS [0.518]TGPR [0.514]HDT [0.478]L1APG [0.473]KCF [0.454]RR [0.443]SVM [0.412]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of SRE

SRDCF [0.748]DSST [0.673]MCFTS [0.669]TGPR [0.661]HDT [0.630]RR [0.619]KCF [0.580]SVM [0.570]L1APG [0.540]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots of SRE

SRDCF [0.523]DSST [0.465]MCFTS [0.452]TGPR [0.448]HDT [0.406]RR [0.401]KCF [0.378]L1APG [0.367]SVM [0.358]

Fig. 3: The overall performance of the trackers using OPE,TRE, and SRE plots. The performance score is shown in thelegend of the figures. The trackers are ranked in descendingorder according to their performance.

Most of these trackers have the state-of-the-art performance invisual tracking. For all of these trackers, we use the originalparameters of them for evaluation.

Evaluation Results. Precision plots and success plots are usedto show the overall performance of the tracker, and the resultsare shown in Fig. 3. OPE is used to show the precision andsuccess rate of the tracker, while TRE and SRE are used toshow the robustness of the tracker.

Analysis. The trackers are ranked according to their perfor-mance score. From Fig. 3, we can see that the ranking of thetrackers are slightly different in the precision plots and successplots. This is because the evaluation metrics are different inthese two plots. The precision is ranked according to thepercentage of successful frames where center location error iswithin a fixed threshold (20 pixels), while the success rate isranked according to AUC scores. We suggest that the successplots are more accurate than the precision plot for ranking.

Page 6: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 6PREPRINT ON ARXIV 6

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of TRE - occlusion (39)

SRDCF [0.701]MCFTS [0.687]TGPR [0.656]HDT [0.647]DSST [0.639]RR [0.633]KCF [0.598]SVM [0.569]L1APG [0.559]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots - occlusion (39)

SRDCF [0.537]MCFTS [0.501]DSST [0.485]TGPR [0.475]HDT [0.460]KCF [0.425]L1APG [0.417]RR [0.415]SVM [0.366]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of TRE - scale variation (35)

SRDCF [0.722]MCFTS [0.680]DSST [0.669]TGPR [0.662]RR [0.645]HDT [0.643]SVM [0.619]L1APG [0.608]KCF [0.586]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots - scale variation (35)

SRDCF [0.571]DSST [0.525]MCFTS [0.493]TGPR [0.479]L1APG [0.457]RR [0.451]HDT [0.440]SVM [0.415]KCF [0.404]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of TRE - background clutter (52)

SRDCF [0.730]TGPR [0.694]MCFTS [0.690]DSST [0.686]HDT [0.655]RR [0.647]KCF [0.626]L1APG [0.613]SVM [0.611]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9S

ucce

ss r

ate

Success plots - background clutter (52)

SRDCF [0.565]DSST [0.532]MCFTS [0.508]TGPR [0.507]HDT [0.466]L1APG [0.459]KCF [0.448]RR [0.429]SVM [0.398]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Precision plots of TRE - low resolution (14)

SRDCF [0.885]L1APG [0.879]DSST [0.866]TGPR [0.862]MCFTS [0.838]HDT [0.836]KCF [0.771]SVM [0.750]RR [0.744]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots - low resolution (14)

SRDCF [0.601]DSST [0.588]L1APG [0.579]TGPR [0.554]MCFTS [0.501]HDT [0.487]KCF [0.444]SVM [0.408]RR [0.404]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Precision plots of TRE - fast motion (8)

SRDCF [0.871]DSST [0.801]MCFTS [0.796]RR [0.789]TGPR [0.772]HDT [0.759]SVM [0.755]KCF [0.723]L1APG [0.678]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Suc

cess

rat

e

Success plots - fast motion (8)

SRDCF [0.671]DSST [0.620]MCFTS [0.568]TGPR [0.542]RR [0.533]HDT [0.528]KCF [0.518]SVM [0.491]L1APG [0.485]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of TRE - motion blur (22)

SRDCF [0.759]TGPR [0.745]DSST [0.736]MCFTS [0.727]RR [0.709]L1APG [0.708]HDT [0.701]SVM [0.676]KCF [0.671]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots - motion blur (22)

SRDCF [0.537]DSST [0.512]TGPR [0.498]L1APG [0.493]MCFTS [0.489]HDT [0.458]KCF [0.437]RR [0.436]SVM [0.406]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of TRE - out of view (10)

MCFTS [0.593]SRDCF [0.579]RR [0.556]HDT [0.554]TGPR [0.546]DSST [0.499]KCF [0.498]SVM [0.442]L1APG [0.410]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots - out of view (10)

SRDCF [0.535]MCFTS [0.531]TGPR [0.470]DSST [0.457]RR [0.456]HDT [0.455]KCF [0.419]L1APG [0.378]SVM [0.370]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Precision plots of TRE - thermal crossover (22)

SRDCF [0.795]TGPR [0.772]MCFTS [0.762]DSST [0.752]L1APG [0.742]HDT [0.735]KCF [0.720]RR [0.719]SVM [0.661]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Suc

cess

rat

e

Success plots - thermal crossover (22)

SRDCF [0.598]DSST [0.559]TGPR [0.545]MCFTS [0.544]L1APG [0.535]HDT [0.501]KCF [0.493]RR [0.451]SVM [0.398]

Fig. 4: TRE plots on eight attribute subsets. The title of the figure is the attribute name and corresponding sequences number.

Therefore, in the following, we mainly use the success plotsto analyze the tracker’s ranking and use the precision plots asan auxiliary.

The tracker’s performance in TRE is higher than in OPEbecause the trackers tend to perform well in a shorter sequence.The OPE result is just one trial of TRE. On the contrary, thetracker’s performance in SRE is lower than in OPE since theinitialization errors tend to cause the drift of the trackers.

In the success plots, we can see that SRDCF ranks the firstone. That means its success rate and robustness are the bestin all of these trackers. It is noteworthy that, SRDCF is muchbetter than others. In the OPE and SRE, the performance ofSRDCF is higher than the second best DSST more than 5%.

In the TRE, it is also higher than DSST about 3%. We suggestthat its superior performance benefits from the consideration ofmore background information in tracking. On the other hand,it is easy to see that the performance of DSST is higher thanKCF more than 10%. This shows that the scale estimationstrategy is very important in the TIR pedestrian tracking.

Deep feature based TIR tracker MCFTS achieves a promis-ing result despite the deep features are learned from the visibleimages. We suggest that the deep feature based trackers havepotential to achieve better performance if there are enoughthermal images for training. On the other hand, MCFTS justuses a simple combintion method between the deep networkand KCF. We consider a deeper level integration between of

Fig. 4: TRE plots on eight attribute subsets. The title of the figure is the attribute name and corresponding sequences number.

Therefore, in the following, we mainly use the success plotsto analyze the tracker’s ranking and use the precision plots asan auxiliary.

The tracker’s performance in TRE is higher than in OPEbecause the trackers tend to perform well in a shorter sequence.The OPE result is just one trial of TRE. On the contrary, thetracker’s performance in SRE is lower than in OPE since theinitialization errors tend to cause the drift of the trackers.

In the success plots, we can see that SRDCF ranks the firstone. That means its success rate and robustness are the bestin all of these trackers. It is noteworthy that, SRDCF is muchbetter than others. In the OPE and SRE, the performance ofSRDCF is higher than the second best DSST more than 5%.In the TRE, it is also higher than DSST about 3%. We suggestthat its superior performance benefits from the consideration of

more background information in tracking. On the other hand,it is easy to see that the performance of DSST is higher thanKCF more than 10%. This shows that the scale estimationstrategy is very important in the TIR pedestrian tracking.

Deep feature based TIR tracker MCFTS achieves a promis-ing result despite the deep features are learned from the visibleimages. We suggest that the deep feature based trackers havepotential to achieve better performance if there are enoughthermal images for training. On the other hand, MCFTS justuses a simple combintion method between the deep networkand KCF. We consider a deeper level integration between ofthem can obtain more robust tracking results. In addition, wecan see that the performance of Gaussian regression trackerTGPR is higher than ridge regression tracker RR about 5%.This is a prominent improvement. Therefore, we think a more

Page 7: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 7

complexity kernel function is better than a simple one in theregression-based TIR pedestrian tracker.

B. Attribute-based Evaluation

Results. In order to understand the tracker’s performance fordifferent challenges, we evaluate nine trackers on the attributesubsets, the results are shown in Fig. 4. Here, we just showthe TRE plots on these subsets due to space limitation. OPEand SRE plots are presented in the supplemental material.

Analysis. The tracker’s performance on an attribute subsetshows its ability to handle this challenge. As shown inFig. 4, we can see that SRDCF achieves the best performanceon almost all attribute subsets because it effective solvesthe boundary effect brought from the cyclic shift. MCFTSachieves a better performance on the occlusion and out-of-view subset than DSST, despite DSST has the higher overallperformance than MCFTS (see TRE of Fig. 3). That shows thatthe deep learning based trackers have a promising performancein the TIR pedestrian tracking. KCF and HDT have theworst performance on the scale variation subset since theylack the scale estimation strategy, while DSST obtains thesecond-best performance on the scale variation subset andthe entire dataset since it deals with the scale variation. Thisdemonstrates the scale estimation strategy is very useful toimprove the tracking performance. On the fast motion subset,we can see that RR achieves a much higher success ratethan its overall performance (see Fig. 3). We suggest that thestochastic particle filter search framework plays a major role.These results are helpful to us to understand the strength andweakness of the trackers.

C. Speed Comparison

We carry out the experiments on the same PC with an IntelI7-6700K CPU, 32G RAM, and a GeForce GTX 1080 GPUcard. The speed of the tracker is the average fps on the TREresults. The comparison of the tracker’s speed is shown inTable IV.

Analysis. Table IV shows that KCF has a high speed whileDSST also exceeds the real-time speed. The high efficiencyof these two trackers benefits from the computation in theFourier domain. What’s more, we can see SRDCF obtains ahalf of the real-time speed when it achieves the best precisionand success rate. Two deep learning based trackers, HDT andMCFTS just get a low frame rate due to the high cost ofthe feature extraction. Several classifier based trackers such asSVM and RR also have a low speed because they base on atime-consuming particle filter search framework.

V. VALIDATION EXPERIMENTS

In this section, we conduct three comparison experiments tovalidate each component’s effectiveness to the TIR pedestriantracking performance. First, we carry out a comparison exper-iment on a baseline tracker with six different features extractorin Section V-A. Then, a comparison experiment on a baselinetracker with three different motion models is implemented in

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of OPE

RidgeRegression-hog [0.659]RidgeRegression-lbp [0.609]RidgeRegression-haar [0.571]RidgeRegression-gabor [0.488]RidgeRegression-sift [0.464]RidgeRegression-gray [0.321]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots of OPE

RidgeRegression-hog [0.427]RidgeRegression-lbp [0.384]RidgeRegression-haar [0.375]RidgeRegression-gabor [0.344]RidgeRegression-sift [0.314]RidgeRegression-gray [0.233]

Fig. 5: OPE plots of the baseline tracker using six differentfeature extractors.

Section V-B. Finally, we compare several different observationmodels to validate how they affect the tracking performancein Section V-C.

A. Feature Extractor.

To understand the effect of the feature extractor to the TIRpedestrian tracking performance, we test six different featureson a baseline tracker. These features are commonly used inthe object detection and tracking.

• Gray. It simply resizes the image into a fixed size, andthen uses the pixel values as features.

• SIFT [38]. It is a local feature descriptor and robust tothe scale variation. Here, we use its fast version: denseSIFT using VLFeat toolkit [39].

• Haar-like [40]. It reflects the gray variation of the image.We use the simplest form, rectangular Haar-like features.

• Gabor [41]. It can capture the texture information of theimage using a series of different direction’s Gabor filters.

• LBP [42]. It is a simple yet efficient local texture de-scriptor which has two advantages: rotation invarianceand gray invariance.

• HOG [43]. It is used to capture the local gradientdirection and gradient intensity distribution of the image.

We use Ridge Regression [13] as the baseline tracker, and theresults are shown in Fig. 5.

Analysis. Fig. 5 shows that the baseline tracker using HOGfeature achieves the best performance which is higher thanthe one using gray feature about 20%. It demonstrates thatlocal gradient features are more helpful for TIR pedestriantracking. Although the gray features are commonly used inthe previous studies, it has a low discriminative ability forthe TIR pedestrian object. The assumption that the object iswarmer than its background is no longer suitable for TIRpedestrian tracking, because the TIR pedestrians often havesimilar intensity to their backgrounds. What’s more, we cansee that the local texture feature LBP obtains second-bestperformance and it has a small gap to the best performance.This illustrates that along with the improvement of thermalimage’s quality and resolution, texture features are useful forthe TIR pedestrian tracking. Here, we not consider the deep

Page 8: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 8

TABLE IV: Speed comparison of the trackers.

KCF [31] DSST [32] SVM [13] SRDCF [33] HDT [34] RR [13] MCFTS [35] L1APG [37] TGPR [36]

FPS 393.40 96.30 13.40 12.29 10.60 6.64 4.73 3.66 1.77

feature extracted from the pre-trained network (e.g., VGG-Net [44]) since it is very slowly in this framework. It is notsuitable for real-time tracking applications.

Findings. Feature extractor is the most important componentin the TIR pedestrian tracker. It has a major effect to the TIRpedestrian tracker’s performance. Choosing or developing astrong feature can dramatically enhance the tracking perfor-mance.

B. Motion Model.

To understand the effect of the motion model to the TIRpedestrian tracking performance, we test three commonlyused motion models on a baseline tracker with two differentfeatures.

• Particle Filter [45]. It is a sequential Bayesian impor-tance sampling technique which belongs to the stochasticsearch method.

• Sliding Window. Sliding window is a kind of exhaustivesearch method. It simply considers all candidates withina square neighborhood.

• Radius Sliding Window [21]. Radius sliding window isan improvement version of the sliding window. It justconsiders all candidates within a circular region.

We also use Ridge Regression [13] as the baseline tracker, andthe results are shown in Fig. 6.

Analysis. The particle filter has several advantages such as itcan solve the scale variation problem, and recover from thetracking failure. However, from Fig. 6, it is easy to see thatthe particle filter is much worse than the sliding window whenwe use the weak feature (Gray) in the baseline tracker. Wesuggest that this is mainly because the gray feature lacks thediscriminative power leading to the drift of the particle filter.On the contrary, when we use the strong feature (HOG) inthe baseline tracker, three motion models perform the samegood. It is interesting that the particle filter performs no anysuperiority to the tracking performance while it keeps a highcomputation complexity.

Findings. Different motion models have a minor effect tothe tracker’s performance when the feature is strong enough.Therefore, a faster search strategy (e.g., sliding window) ismore helpful for the TIR pedestrian tracking.

C. Observation Model.

Observation models are widely studied in the tracking filedsince the rapid development of the machine learning method.To understand the effect of the observation model to theTIR pedestrian tracking performance, we test four observationmodels using two different features.

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pre

cisi

on

Precision plots of OPE

RidgeRegression-gray-sw [0.514]RidgeRegression-gray-rsw [0.477]RidgeRegression-gray-pf [0.321]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Suc

cess

rat

e

Success plots of OPE

RidgeRegression-gray-sw [0.355]RidgeRegression-gray-rsw [0.344]RidgeRegression-gray-pf [0.233]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of OPE

RidgeRegression-hog-sw [0.665]RidgeRegression-hog-rsw [0.664]RidgeRegression-hog-pf [0.659]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots of OPE

RidgeRegression-hog-pf [0.427]RidgeRegression-hog-rsw [0.426]RidgeRegression-hog-sw [0.423]

Fig. 6: OPE plots of the baseline tracker using three differentmotion models with two different feature extractors. Theabbreviations pf, sw, rsw denotes the particle filter, slidingwindow, and radius sliding window respectively.

• Logistic Regression. It is a kind of linear regression with`2 regularization. We use the gradient descent to updatethe model online.

• Ridge Regression. Least squares regression with `2 reg-ularization is used. The method comes from [13].

• SVM. It is a standard SVM with hinge loss and `2regularization.

• Structured Output SVM (SOSVM). This method is anenhanced version of SVM and comes from [21].

These four observation models are ranked according to theirclassification ability. The results are shown in Fig. 7.

Analysis. Fig. 7 shows that a strong observation modelSOSVM achieves the best performance when the weak fea-ture (Gray) is used. It exceeds the weak observation modelRidge Regression more than 10%. However, when we usethe strong feature HOG, the weak observation model RidgeRegression obtains the best performance against the strongerSVM and SOSVM. What’s more, it is easy to see that theseobservation models have similar performance when the strongfeature is used. Similar observations are reported in the visualtracking [13].

Findings. Strong observation model can obtain higher trackingperformance when the feature is weakly. However, when the

Page 9: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 9

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pre

cisi

on

Precision plots of OPE

LogisticRegression-gray [0.459]SOSVM-gray [0.458]SVM-gray [0.456]RidgeRegression-gray [0.321]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Suc

cess

rat

e

Success plots of OPE

SOSVM-gray [0.345]LogisticRegression-gray [0.337]SVM-gray [0.329]RidgeRegression-gray [0.233]

0 5 10 15 20 25 30 35 40 45 50

Location error threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Precision plots of OPE

RidgeRegression-hog [0.659]SOSVM-hog [0.625]LogisticRegression-hog [0.619]SVM-hog [0.609]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Suc

cess

rat

e

Success plots of OPE

RidgeRegression-hog [0.427]SOSVM-hog [0.422]LogisticRegression-hog [0.421]SVM-hog [0.386]

Fig. 7: OPE plots of the four tracker using different observa-tion models with two different feature extractors.

feature is strong enough, different observation models have aminor gap in the tracking performance.

VI. CONCLUSION AND FUTURE WORK

In this paper, we collect a TIR pedestrian tracking bench-mark dataset for the TIR pedestrian tracker evaluation. A large-scale evaluation experiment is carried out on our benchmarkwith nine public available trackers. Based on our evaluationresults and analysis, several observations are highlighted forunderstanding the TIR pedestrian tracker. First, the scaleestimation strategy is very important for TIR pedestrian track-ing, and it can improve the tracking performance to a largeextent. Second, the background information is crucial for adiscriminative tracker, and it can enhance the discriminativepower of the model. Third, deep learning based trackers havepotential to obtain better performance in the TIR pedestriantracking.

In addition, in order to understand the TIR pedestrian track-ing more sufficiently, we conduct three validation experimentson each component of the tracker. Some interesting findingsare helpful for the future research. First, the feature extrac-tor is the most important component in the TIR pedestriantracker, a strong feature can significantly improve the trackingperformance. Second, different motion models have a minoreffect on the tracking results when the feature is strong enough.Third, different observation models also have a minor effecton the tracking performance when the feature is strong. Onthe contrary, when the feature is weak, a stronger observationmodel often can achieve better performance.

The evaluation and validation experimental results help usto understand the TIR pedestrian tracker from several aspects.

This will promote the development of this field. In the future,we are going to extend the dataset to include more thermalsequences and explore more challenge factors in the TIRpedestrian tracking.

ACKNOWLEDGMENT

This research was supported by the National NaturalScience Foundation of China (Grant Nos, 61272252,U1509216, 61472099, 61672183), by the Science andTechnology Planning Project of Guangdong Province (GrantNo. 2016B090918047), by the Natural Science Foundationof Guangdong Province (Grant No. 2015A030313544),and by the Shenzhen Research Council (Grant Nos.JCYJ20170413104556946, JCYJ20160406161948211,JCYJ20160226201453085, JSGG20150331152017052).

REFERENCES

[1] R. Gade and T. B. Moeslund, “Thermal cameras and applications: asurvey,” Machine vision and applications, vol. 25, no. 1, pp. 245–262,2014. 1

[2] J. Li and W. Gong, “Real time pedestrian tracking using thermal infraredimagery.” Journal of Computers, vol. 5, no. 10, pp. 1606–1613, 2010.1, 3

[3] J.-t. Wang, D.-b. Chen, H.-y. Chen, and J.-y. Yang, “On pedestriandetection and tracking in infrared videos,” Pattern Recognition Letters,vol. 33, no. 6, pp. 775–785, 2012. 1, 3

[4] B. C. Ko, J.-Y. Kwak, and J.-Y. Nam, “Human tracking in thermalimages using adaptive particle filters with online random forest learning,”Optical Engineering, vol. 52, no. 11, pp. 113 105–113 105, 2013. 1, 2,3, 4

[5] X. Li, R. Guo, and C. Chen, “Robust pedestrian tracking and recognitionfrom flir video: A unified approach via sparse coding,” Sensors, vol. 14,no. 6, pp. 11 245–11 259, 2014. 1, 2

[6] J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection andtracking from aerial thermal views,” in IEEE International Conferenceon Robotics and Automation (ICRA), 2014, pp. 1794–1800. 1

[7] Y. Ma, X. Wu, G. Yu, Y. Xu, and Y. Wang, “Pedestrian detection andtracking from low-resolution unmanned aerial vehicle thermal imagery,”Sensors, vol. 16, no. 4, p. 446, 2016. 1

[8] T. Yang, D. Fu, and S. Pan, “Pedestrian tracking for infrared imagesequence based on trajectory manifold of spatio-temporal slice,” Multi-media Tools and Applications, vol. 76, no. 8, pp. 11 021–11 035, 2017.1

[9] J. W. Davis and V. Sharma, “Background-subtraction using contour-based fusion of thermal and visible imagery,” Computer Vision andImage Understanding, vol. 106, no. 2, pp. 162–182, 2007. 1, 2, 4

[10] R. Miezianko, “Ieee otcbvs ws series benchmark,” http://vcipl-okstate.org/pbvs/bench/, accessed March 4, 2018. 1, 2, 4

[11] J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection andtracking from aerial thermal views,” in IEEE International Conferenceon Robotics and Automation (ICRA), 2014, pp. 1794–1800. 1, 2, 4

[12] Z. Wu, N. Fuller, D. Theriault, and M. Betke, “A thermal infrared videobenchmark for visual analysis,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR) Workshops, 2014, pp. 201–208. 1, 2,4

[13] N. Wang, J. Shi, D.-Y. Yeung, and J. Jia, “Understanding and diagnosingvisual tracking systems,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2015, pp. 3101–3109. 1, 3, 5, 7, 8

[14] C. Dai, Y. Zheng, and X. Li, “Pedestrian detection and tracking ininfrared imagery using shape and appearance,” Computer Vision andImage Understanding, vol. 106, no. 2, pp. 288–299, 2007. 2

[15] K. Jungling and M. Arens, “Pedestrian tracking in infrared from movingvehicles,” in IEEE Intelligent Vehicles Symposium, 2010, pp. 470–477.2, 3

[16] C. Liang, Q. Liu, and R. Xu, “Local sparse appearance model withspecific structural information in infrared pedestrian tracking,” in IEEEInternational Conference on Image, Vision and Computing, 2016, pp.19–25. 2

[17] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking withnight vision,” IEEE Transactions on Intelligent Transportation Systems,vol. 6, no. 1, pp. 63–71, 2005. 2, 3, 4

Page 10: PREPRINT ON ARXIV 1 PTB-TIR: A Thermal Infrared ...tracking system. The experimental findings are analyized to given some guidelines for future research. The rest of the paper is

PREPRINT ON ARXIV 10

[18] Z. Wang, Y. Wu, J. Wang, and H. Lu, “Target tracking in infrared imagesequences using diverse adaboostsvm,” in International Conference onInnovative Computing, Information and Control, 2006, pp. 233–236. 2,4

[19] X. Wang and Z. Tang, “Modified particle filter-based infrared pedestriantracking,” Infrared Physics & Technology, vol. 53, no. 4, pp. 280–287,2010. 2, 3

[20] D.-E. Kim and D.-S. Kwon, “Pedestrian detection and tracking inthermal images using shape features,” in International Conference onUbiquitous Robots and Ambient Intelligence (URAI), 2015, pp. 22–25.3

[21] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks,and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEEtransactions on pattern analysis and machine intelligence, vol. 38,no. 10, pp. 2096–2109, 2016. 3, 8

[22] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-lineboosting.” in British Machine Vision Conference (BMVC), vol. 1, no. 5,2006, p. 6. 4

[23] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with onlinemultiple instance learning,” in IEEE Conference on Computer Visionand Pattern Recognition, 2009, pp. 983–990. 4

[24] A. Torabi, G. Masse, and G.-A. Bilodeau, “An iterative integratedframework for thermal–visible image registration, sensor fusion, andpeople tracking for video surveillance applications,” Computer Visionand Image Understanding, vol. 116, no. 2, pp. 210–221, 2012. 4

[25] Y. Socarras, S. Ramos, D. Vazquez, A. M. Lopez, and T. Gevers,“Adapting pedestrian detection from synthetic to far infrared images,” inInternational Conference on Computer Vision (ICCV) Workshop, vol. 7,2011. 4

[26] A. Gonzalez, Z. Fang, Y. Socarras et al., “Pedestrian detection atday/night time with visible and fir cameras: A comparison,” Sensors,vol. 16, no. 6, p. 820, 2016. 4

[27] M. Felsberg, M. Kristan, J. Matas, A. Leonardis et al., The ThermalInfrared Visual Object Tracking VOT-TIR2016 Challenge Results, 2016,pp. 824–849. 4

[28] INO, “Ino video analytics dataset,” www.ino.ca/en/video-analytics-dataset/, accessed March 4, 2018. 4

[29] Youtube, “Youtube thermal videos,” www.youtube.com, accessed March4, 2018. 4

[30] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2013, pp. 2411–2418. 5

[31] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation filters,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596,2015. 5, 8

[32] M. Danelljan, G. Hager, F. Khan, and M. Felsberg, “Accurate scale esti-mation for robust visual tracking,” in British Machine Vision Conference(BMVC), 2014. 5, 8

[33] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learningspatially regularized correlation filters for visual tracking,” in IEEEInternational Conference on Computer Vision (ICCV), 2015, pp. 4310–4318. 5, 8

[34] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang,“Hedged deep tracking,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2016, pp. 4303–4311. 5, 8

[35] Q. Liu, X. Lu, Z. He, C. Zhang, and W.-S. Chen, “Deep convolutionalneural networks for thermal infrared object tracking,” Knowledge-BasedSystems, vol. 134, pp. 189–198, 2017. 5, 8

[36] J. Gao, H. Ling, W. Hu, and J. Xing, “Transfer learning based visualtracking with gaussian processes regression,” in European Conferenceon Computer Vision (ECCV), 2014, pp. 188–203. 5, 8

[37] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 trackerusing accelerated proximal gradient approach,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012, pp. 1830–1837. 5, 8

[38] D. G. Lowe, “Object recognition from local scale-invariant features,” inIEEE International Conference on Computer Vision (ICCV), 1999, pp.1150–1157. 7

[39] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library ofcomputer vision algorithms,” http://www.vlfeat.org/, 2008. 7

[40] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), vol. 1, 2001, pp. I–I. 7

[41] M. Haghighat, S. Zonouz, and M. Abdel-Mottaleb, “Cloudid: Trustwor-thy cloud-based and cross-enterprise biometric identification,” ExpertSystems with Applications, vol. 42, no. 21, pp. 7905–7916, 2015. 7

[42] T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation oftexture measures with classification based on kullback discrimination ofdistributions,” in IEEE International Conference on Pattern Recognition(ICPR), vol. 1, 1994, pp. 582–585. 7

[43] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), vol. 1, 2005, pp. 886–893. 7

[44] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.8

[45] B. Ristic, S. Arulampalam, and N. J. Gordon, Beyond the Kalman filter:Particle filters for tracking applications. Artech house, 2004. 8