[ieee 2012 19th ieee international conference on image processing (icip 2012) - orlando, fl, usa...
TRANSCRIPT
FRAGMENT-BASED REAL-TIME OBJECT TRACKING:A SPARSE REPRESENTATION APPROACH
Naresh Kumar M. S. Priti Parate R. Venkatesh Babu
Supercomputer Education and Research CentreIndian Institute of Science, Bangalore, India - 560012
ABSTRACT
Real-time object tracking is a critical task in many computer visionapplications. Achieving rapid and robust tracking while handlingchanges in object pose and size, varying illumination and partialocclusion, is a challenging task given the limited amount of com-putational resources. In this paper we propose a real-time objecttracker in l1 framework addressing these issues. In the proposed ap-proach, dictionaries containing templates of overlapping object frag-ments are created. The candidate fragments are sparsely representedin the dictionary fragment space by solving the l1 regularized leastsquares problem. The non zero coefficients indicate the relative mo-tion between the target and candidate fragments along with a fidelitymeasure. The final object motion is obtained by fusing the reliablemotion information. The dictionary is updated based on the objectlikelihood map. The proposed tracking algorithm is tested on variouschallenging videos and found to outperform earlier approach.
Index Terms— Object tracking, Fragment tracking, Motion es-timation, l1 minimization, Sparse representation
1. INTRODUCTION
Visual tracking is an important task in computer vision with a vari-ety of applications such as surveillance, robotics, human computerinteractions, medical imaging etc. One of the main challenges thatlimits the performance of the tracker is appearance change causedby pose variation, illumination or view point. Significant amount ofwork has been done to address these problems and develop a robusttracker. However robust object tracking still remains a big challengein computer vision research.
There have been many proposals towards building a robusttracker, a thorough survey can be found in [1]. In early works,minimizing SSD (sum of squared differences) between regions wasa popular choice for the tracking problem [2] and a gradient descentalgorithm was most commonly used to find the minimum SSD. Of-ten in such methods, only a local minimum could be reached. Meanshift tracker [3] uses mean-shift iterations and a similarity mea-sure based on Bhattacharyya coefficient between the target modeland candidate regions to track the object. Incremental tracker [4]and Covariance tracker [5] are other examples of tracking methodswhich use appearance model to represent the target observations.
One of the recently developed and popular trackers is the l1tracker [6]. In this work, the authors have utilized the particle filter toselect the candidate particles and then represent them sparsely in thespace spanned by the object templates using l1 minimization. Thisrequires a large number of particles for reliable tracking. This resultsin a high computational cost and thus brings down the speed of thetracker. An attempt to speed up the tracking by reducing the num-ber of particles only deteriorates the accuracy of the tracker. There
have been attempts to improve the performance of [6]. In [7] theauthors try to reduce the computation time by decomposing a singleobject template into the particle space. In [8] hash kernels are usedto reduce the dimensionality of observation.
In this paper, we propose a computationally efficient l1 mini-mization based real-time and robust tracker. The tracker uses frag-ments of the object and the candidate to estimate the motion of theobject. The number of candidate fragments required to track the ob-ject in this method is small, thus reducing the computational burdenof the l1 tracker. Further, the fragment based approach combinedwith the trivial templates make the tracker robust against partial oc-clusion. The results show that the proposed tracker gives more accu-rate tracking at much higher execution speeds in comparison to theearlier approach.
The rest of the paper is organized as follows. Section 2 providesthe overview of the proposed tracker. Section 3 describes the pro-posed approach in detail. Section 4 discusses the results and Section5 concludes the paper.
2. OVERVIEW
The proposed tracking algorithm is essentially a template tracker inl1 framework. The object is partitioned into overlapping fragmentsthat form the atoms of the dictionary. The candidate fragments aresparsely represented in the space spanned by the dictionary frag-ments by solving the l1 minimization problem. The resulting sparserepresentation indicates the flow of fragments between consecutiveframes. This flow information or the motion vectors are utilizedfor estimating the object motion between consecutive frames. Theproposed algorithm uses only grey scale information for tracking.Similar to the mean-shift tracker [3], the proposed algorithm also as-sumes sufficient overlap between object and candidate regions suchthat there is at-least one fragment in the candidate area that corre-sponds to an object fragment. In this approach two dictionaries areused. One is kept static while the other is updated based on thetracking result and a confidence measure computed using histogrammodels. The dictionaries are initialized with the object selected inthe first frame. The proposed algorithm is able to track objects withrapid changes in appearance, illumination and occlusions at real-time. Changes in size are also tracked up-to some extent.
3. PROPOSED APPROACH
3.1. Sparse representation and l1 minimization
The discriminative property of sparse representation is recently uti-lized for various computer vision applications such as tracking [6],detection [9], classification [10] etc. A candidate vector y can besparsely represented in the space spanned by the vector elements of
433978-1-4673-2533-2/12/$26.00 ©2012 IEEE ICIP 2012
the matrix (called the dictionary) D =[d1,d2, ...,dn
]∈ R
l×n.Mathematically,
y = Da (1)
where a =[a1, a2, ..., an
]T∈ R
n is the coefficient vector of basisD. In application, the system represented by (1) could be underdetermined since l << n and there is no unique solution for a. Sucha system is solved as an l1 regularized least squares problem, whichis known to yield sparse solutions [10].
min||Da − y||22 + λ||a||1 (2)
where ||.||1 and ||.||2 are the l1 and l2 norms respectively.
3.2. Dictionary creation and object-candidate fragments
In the template based tracking in l1 framework, tracking is achievedby matching the candidate template to one among a set of object tem-plates through sparse representation [6]. The set of object templatesform the dictionary. On similar lines, in our method dictionariesare initialized with overlapping fragments of the object. We makeuse of two dictionaries - one is static and the other is updated for thepurpose of modelling the appearance changes. The fragment size de-pends on the original object size and the dictionary array size. Sup-pose the object size isM×N and we choose the dictionary size to beu×v, then the fragment size will be (M−u+1)×(N−v+1). Fig-ure 1 shows the dictionary created from the overlapping fragmentsof the object by going through the object in the raster scan order.Each fragment is resized into a template of predefined size and thenvectorized into a single column vector dqj ∈ R
l. Each dictionaryis a set Dq = [dq1,dq2, ...dqn] ∈ R
l×n. For tracking, candidate
Fig. 1. Object Fragment Dictionary of size 15x15.
in the current frame is taken as the area of pixels where the objectwas located in the previous frame. An array of fragments of the can-didate is constructed in the same way as it was done for the objectdictionary and the size of this array is same as that of the dictionary.Only a certain number of fragments sub-sampled among this arrayof candidate fragments are sufficient to estimate the motion of theobject and track it.
Each candidate fragment is represented as a sparse linear com-bination of the dictionary fragments. Equation (1) can be rewrit-ten for the kth candidate fragment in a set of p candidate fragmentsY =
[y1,y2, ...,yp
]∈ R
l×p and yk ∈ Rl as,
yk =[D1,D2
] [A1k
T,A2kT]T
(3)
where D1 and D2 are the static and dynamic dictionaries respec-tively and Aqk =
[aqk1, aqk2, ..., aqkn
]T∈ R
n is the correspond-ing target coefficient vector. Equation (3) represents an under de-termined system since l << 2n, which can be solved for a sparsesolution using l1 minimization as described in Section 3.1.
3.3. Handling occlusion, clutter, changes in illumination, ap-pearance and size.
In order to handle mild occlusions and clutter, trivial templates (pos-itive and negative) are used as proposed in [6]. Negative trivialtemplates imposes a non-negativity constraint on the target vectorcoefficients. In our approach, fragmentation helps the tracker evenunder heavy partial occlusions. Occluded candidate fragments withlow confidence measures are eliminated before estimating the objectmotion. Now equations (3) - (2) can be written as,
yk = Dxk (4)
min||Dxk − yk||22 + λ||xk||1 (5)
where,
D =[D1,D2, I,−I
], xk =
[A1k
T,A2kT, (e+
k )T, (e−
k )T]T
,
I =[i1, i2, ..., il
]∈ R
l×l is the set of trivial templates and ii ∈ Rl
is a vector with only one non-zero element. e+
k ∈ Rl and e−
k ∈ Rl
are the positive and negative trivial coefficient vectors respectively.The object pose and illumination are prone to changes and a
static dictionary is unreliable in such cases. A static dictionary losesthe object modelling capability once its appearance changes. Theerror accumulated over time results in a drift from the actual objectposition. Incorporating a second dictionary and updating it through-out the tracking process helps overcome this problem. Currently ouralgorithm does not include measures to handle excessive changes inthe object size. However, it can work with small changes in the ob-ject size since it is easily captured by the fragment templates that areupdated in the second dictionary.
3.4. Motion estimation through fragment matching and confi-dence measure
A sparse solution for xk indicates which fragment from the dictio-naries closely resembles the candidate fragment yk. A sparse re-construction is obtained for all the p candidate fragments. A set ofp′ candidates with highest confidence measure are chosen out of pcandidates. Confidence measure is computed as,
Cconf,k =
(2n∑t=1
xk(t)
)/
(1 +
2n+2l∑t=2n+1
|xk(t)|
)(6)
Motion vectors for the p′ candidates are obtained knowing theoffset in locations of the candidate fragment and the correspondingmatched fragment in the dictionary. We denote these set of motionvectors as MV =
[mv1,mv2, ..., mvp′
], where mvr = (xr, yr)
is the motion vector of rth candidate. The set of p′ motion vectorsare reduced to (s+2) motion vectors by eliminating directional out-liers. Outliers based on magnitude are then removed to get s num-ber of motion vectors MV′ =
[mv′1,mv′2, ..., mv′s
]. The motion
vector for the object is now estimated from the set MV′ using twomethods. In the first method, the resultant motion vector MVobj,1
has its x and y components as the median values of the x and ycomponents of motion vectors in MV′ . In the second method, theresultant motion vector MVobj,2 is computed using,
MVobj,2 =1
s
s∑r=1
|mv′r|
[(s∑
r=1
mv′r
)/
(max(|
s∑r=1
mv′r|, ε)
)]
(7)which is a vector of magnitude equal to mean of |MV′| and a di-rection along the resultant of MV′. ε is a small quantity to avoiddivision by zero.
434
One of the two motion vectors MVobj,1 and MVobj,2 are chosenbased on a confidence measure computed using the histogram mod-els of the object and background. The object histogram Pobj with 20bins is constructed from the pixels occupying the 25% central areaof the object. The background histogram Pbg with 20 bins is con-structed from the pixels occupying the area surrounding the objectup-to 15 pixels. These histograms are normalized. Figure 2 showsthe areas used to construct these histograms. The area between theinnermost rectangle and the middle rectangle is not used as this re-gion contains both object and background pixels which adds confu-sion to the models. The likelihood map is calculated using equation
Background
ObjectNot used
Fig. 2. Pixels used to build the object and background histogram.
(8) for the pixels occupying the central 25% area of the candidatearea T . The confidence measure for each of the motion vector istaken as the sum of the corresponding likelihood values of the pixelsusing equation (9).
L(x, y) = [Pobj(b(T (x, y)))] / [max(Pbg(b(T (x, y))), ε)] (8)
Lconf =∑x
∑y
L(x, y) (9)
where function b maps the pixel at location (x, y) to its bin. Out ofMVobj,1 and MVobj,2, the motion vector with a larger value of thisconfidence measure is chosen. Higher confidence measure impliesthat a larger number of pixels from that target area belong to theobject than that pointed by the other motion vector.
3.5. Dictionary update
Fragments in the second dictionary are chosen for update afteranalysing how well each of the fragments were matched to thecandidate fragments. This can be inferred from the target coef-ficient vectors A2k . The maximum value along the rows (eachrow corresponds to a fragment in the dictionary) in the matrixA =
[A21, A22, ..., A2p
]helps in sorting out fragments that
matched very well, mildly and no match at all with candidate frag-ments. Since there are only p candidate fragments, a large portionof the dictionary fragments would not have matched at all, indicatedby their zero coefficient values. A small number (depending on theupdate factor, which is expressed as the percentage of total numberof elements in each dictionary) of such fragments are updated sincethere was no contribution from them in the current iteration. Theyare updated with the corresponding fragments of the tracking resultafter performing a check on the new fragment based on histogrammodels explained in Section 3.4. The likelihood map, inverse like-lihood map and confidence measure of each new fragment F arecomputed
Lf (x, y) = [Pobj(b(F (x, y)))] / [max(Pbg(b(F (x, y))), ε)](10)
ILf (x, y) = [Pbg(b(F (x, y)))] / [max(Pobj(b(F (x, y))), ε)](11)
Lconf,f =
[∑x
∑y
Lf (x, y)
]/
[∑x
∑y
ILf (x, y)
](12)
The fragment is updated only if the confidence measure Lconf,f >1 (indicates fragment has more pixels belonging to the object) toprevent erroneous updates of the dictionary fragments.
Algorithm 1 Proposed Tracking1: Input: Initial position of the object in the first frame.2: Initialize: D1 and D2 with overlapping fragments of the object.3: repeat4: In next frame, select candidate from same location as the ob-
ject in the previous frame and prepare set of p candidate frag-ments.
5: Solve l1 minimization problem using SPAMS [11] to sparselyreconstruct candidate fragments in the space spanned by dic-tionary fragments.
6: Compute confidence measure Cconf,k using equation (6).7: Choose top p′ candidate fragments based on Cconf,k and
compute their motion vectors MV.8: Remove outliers in MV based on direction and magnitude to
get s number of motion vectors MV′.9: Compute motion vector MVobj,1 as the median values of x
and y components of MV′.10: Compute motion vector MVobj,2 using equation (7).11: Choose MVobj,1 or MVobj,2 as the motion vector for the ob-
ject, whichever gives a higher confidence measure based onlikelihood in (9).
12: Update fragments of dictionary D2 that did not match withany of the candidate fragments if Lconf,f > 1.
13: until End of video feed
4. RESULTS AND DISCUSSION
The proposed tracker is implemented in MATLAB and experimentedon four different video sequences: pktest02 (450 frames), face (206frames), panda (451 frames) and trellis (226 frames). We use thesoftware (SPAMS) provided by [11] to solve the l1 minimizationproblem. For evaluating the performance of the proposed tracker, itsresults are compared with the l1 tracker proposed by Mei et al. [6].The l1 tracker is configured 300 particles, 10 object templates of size12×15. The proposed tracker is configured for p = 25 candidatefragments of size 8×8; p′ = 21, s = 5, and an update factor of 5%.
Figure 3 shows the trajectory error (position error) plot with re-spect to ground truth for the four videos using the proposed methodand l1 tracker [6]. Table 1 summarizes the performance of the track-ers under consideration. It can be seen that the proposed trackerachieves real time performance with better accuracy compared to theparticle filter based l1 tracker [6], while executed on a PC. The pro-posed tracker runs 60−70 times faster than [6]. Figures 4, 5, 6 and 7show the tracking results. The results of the proposed approach andl1 tracker are shown by blue and yellow (dashed) windows respec-tively. In Figure 4, frame number 153 shows that l1 tracker failedwhen the car was occluded by the tree and it continues to drift away.The proposed tracker survives the occlusion and gradual pose changeas seen in frames 153, 156, 219 and 430. Figure 5 also shows that theproposed tracker is robust to changes in appearance and illuminationat frames 69, 114 and 192. Figure 6 shows that the proposed trackerwas able to track drastic changes in pose when the panda changesits direction of motion while the tracker in [6] fails at frames 94 and327. Figure 7 shows the ability of the proposed tracker to track ob-ject even under partial illumination changes because of the fragmentbased approach. In frame 71, it can be seen that the lower left re-
435
100 200 300 4000
50
100
150
200
250
Frame number
Abs
olut
e er
ror
ProposedMei et al.
50 100 150 2000
10
20
30
40
50
Frame number
Abs
olut
e er
ror
ProposedMei et al.
(a) (b)
100 200 300 4000
20
40
60
80
100
Frame number
Abs
olut
e er
ror
ProposedMei et al.
50 100 150 2000
50
100
Frame number
Abs
olut
e er
ror
ProposedMei et al.
(c) (d)
Fig. 3. Trajectory position error with respect to ground truth for: (a)pktest02 (b) face (c) panda and (d) trellis sequences.
gion is illuminated more. Fragments on the lower left region wouldgive low confidence measures and are discarded before computingthe object displacement, whereas tracker in [6] uses the entire ob-ject to build the dictionary of templates and hence fails to track theobject under such partial illumination changes. The videos corre-sponding to the results presented in Figs. 4 to 7 are available athttp://www.serc.iisc.ernet.in/∼venky/tracking results/.
Table 1. Execution time and trajectory error (RMSE) comparison ofthe proposed tracker and l1 tracker [6]
Video Execution time Trajectory Error(Number per frame (seconds) (RMSE)of frames) Proposed [6] Proposed [6]
pktest02 (450) 0.0316 2.0770 2.9878 119.5893face (206) 0.0308 2.2194 7.0961 9.5666
panda (451) 0.0303 2.2742 4.7350 25.5386trellis (226) 0.0301 2.1269 12.8113 42.3399
5. CONCLUSION AND FUTURE WORK
In this paper we have proposed a computationally efficient trackingalgorithm which makes use of fragments of the object and candi-date to track the object. The performance of the proposed trackerhas been demonstrated by various complex video sequences and isshown to perform better than the earlier tracker in terms of both ac-curacy and speed. Future work includes improvement of the dictio-nary and its update mechanism to model the changes in pose, sizeand illumination of the object more precisely.
6. REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah., “Object tracking: A sur-vey,” ACM Computing Surveys, vol. 38, no. 4, 2006.
[2] G. D. Hager and P. N. Belhumeur, “Efficient region trackingwith parametric models of geometry and illumination,” IEEE
Fig. 4. Result for pktest02 video at frames 5, 153, 156, 219 and430. [Color convention for all results: solid blue - proposed tracker;dashed yellow - l1 tracker.]
Fig. 5. Result for face video at frames 3, 10, 69, 114 and 192.
Fig. 6. Result for panda video at frames 4, 45, 94, 327 and 450.
Fig. 7. Result for trellis video at frames 12, 24, 71, 141 and 226.
Transactions on Pattern Analysis and Machine Intelligence,vol. 20, no. 10, pp. 1025–1039, 1998.
[3] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time track-ing of non-rigid objects using mean shift,” in Proceedings ofIEEE Conference on Computer Vision and Pattern Recogni-tion, 2000, vol. 2, pp. 142–149.
[4] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang, “Incremental learning for robust visual tracking,”International Journal of Computer Vision, vol. 77, no. 1-3, pp.125–141, 2008.
[5] F. Porikli, O. Tuzel, and Peter Meer, “Covariance trackingusing model update based on lie algebra,” in Proceedings ofIEEE Conference on Computer Vision and Pattern Recogni-tion, 2006, vol. 1, pp. 728–735.
[6] Xue Mei and Haibin Ling, “Robust visual tracking using l1minimization,” in Proceedings of IEEE International Confer-ence on Computer Vision, 2009, pp. 1436–1443.
[7] Huaping Liu and Fuchun Sun, “Visual tracking using spar-sity induced similarity,” in Proceedings of IEEE InternationalConference on Pattern Recognition, 2010, pp. 1702–1705.
[8] Hanxi Li and Chunhua Shen, “Robust real-time visual track-ing with compressed sensing,” in Proceedings of IEEE Inter-national Conference on Image Processing, 2010.
[9] Ran Xu, Baochang Zhang, Qixiang Ye, and Jianbin Jiao, “Hu-man detection in images via l1-norm minimization learning,”in Proceedings of IEEE International Conference on AcousticsSpeech and Signal Processing, 2010, pp. 3566–3569.
[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.,“Robust face recognition via sparse representation,” In IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 31, no. 1, pp. 210–227, 2009.
[11] SPAMS, “http://www.di.ens.fr/willow/spams/,” .
436